Dataset Card for EUR-Lex SUM¶

The Danish subsection of EUR-lex SUM consisting of EU legislation paired with professionally written summaries.

EUR-Lex SUM is a dataset containing summaries of EU legislation from the EUR-Lex database. It consists of pairs of full legal texts and their corresponding professionally written summaries, covering European Union legal documents. The dataset is designed for training and evaluating automatic text summarization systems, particularly for legal documents. It's valuable for natural language processing (NLP) research since it provides high-quality, human-written summaries of complex legal texts in a specialized domain.

Dataset Description¶

Number of samples: 1.00K
Number of tokens (Llama 3): 31.37M
Average document length in tokens (min, max): 31.31K (2.14K, 1.72M)

Dataset Structure¶

An entry in the dataset consists of the following fields:

id (str): An unique identifier for each document.
text(str): The content of the document.
source (str): The source of the document.
added (str): An date for when the document was added to this collection.
created (str): An date range for when the document was originally created.
token_count (int): The number of tokens in the sample computed using the Llama 8B tokenizer

Dataset Card for EUR-Lex SUM¶

Dataset Description¶

Dataset Structure¶

Additional Processing¶

Dataset Statistics¶

Additional Information¶

License Information¶

Citation Information¶