Dataset Card for Nota lyd- og tekstdata (Tekst only)¶
The text only part of the Nota lyd- og tekstdata dataset.
Nota lyd- og tekstdata (Tekst only) is a readaloud dataset consisting of few very long texts.
Dataset Description¶
- Number of samples: 446
- Number of tokens (Llama 3): 7.30M
- Average document length in tokens (min, max): 16.37K (4.48K, 107.26K)
Dataset Structure¶
An example from the dataset looks as follows.
{
"id": "INSL20160004",
"text": "Inspiration nr. 4, 2016\nBiblioteksbetjening \nTelefon: 39 13 46 00\nEmail: biblioteket@nota.dk\nInspira[...]",
"source": "nota",
"added": "2025-02-03",
"created": "2016-01-01, 2016-12-31",
"token_count": 69977
}
Data Fields¶
An entry in the dataset consists of the following fields:
id
(str
): An unique identifier for each document.text
(str
): The content of the document.source
(str
): The source of the document.added
(str
): An date for when the document was added to this collection.created
(str
): An date range for when the document was originally created.token_count
(int
): The number of tokens in the sample computed using the Llama 8B tokenizer
Additional Processing¶
Dataset Statistics¶