Data Sources¶

The data language models are trained on is decisive for what they can be used for. In Danish Foundation Models (DFM), our approach is to have certainty that we are permitted to use the data we train on from data owners, and to focus on value-creating use cases. We pursue this, among other ways, through our collaboration with the Danish Language Model Consortium.

Current Data Sources¶

We continuously work to gather data from more sources. The table below lists sources that, to the best of our knowledge, may currently be used to train a Danish language model. The amount of data we have at present is not sufficient to train a Danish language model from scratch. Sizes are given in number of characters.

Dataset	Date	Domain	License	Size
AI aktindsigt	current	Municipal websites	CC0-1.0	408M
Domsdatabasen	1855–now	Court rulings	CC0-1.0	91.2M
Eur-lex-sum-da	1993–now	Law (EU)	CC-BY-SA 4.0	87.8M
FTSpeech	2017–now	Parliamentary speeches	Non-standard	244M
Scrape Hovedstaden	current	Health	CC0-1.0	79.9M
MeMo	1870–1899	Fiction	Public Domain	319M
Wikipedia	current	Encyclopaedia	CC-BY-SA 4.0	498M
Retsinformation.dk (*)	current	Legislation	Non-standard (*)	1.42G
Skat.dk (*)	current	Tax information	CC0-1.0	354M
H-Sø (*)	current	Court cases	CC0-1.0	204M
Hestenettet (*)	current	Forum	CC0-1.0	1.19G
Folketinget (*)	2009–2019	Debate	Non-standard	351M
Europarl (*)	2004–2008	Debate	CC0-1.0	312M
Spontaneous Speech (*)	2019	Conversations	CC0-1.0	4.0M
NAAT (*)	1930–now	Speeches	CC0-1.0	881k
Dansk Litteratur (*)	1700–now	Literature	CC0-1.0	162M
Gutenberg (*)	1700–now	Literature	Non-standard	19.2M
WikiBooks (*)	2019–2020	Manuals	CC0-1.0	17.5M
WikiSource (*)	1700–now	Literature	CC0-1.0	15.5M
Johannes V. Jensen (*)	–	JVJ's works	CC-BY-SA 4.0	10.7M
Religious Texts (*)	–	Religious	CC0-1.0	3.56M
TV2R (*)	2015–2019	News	CC-BY 4.0	64.04M
Dasem Data (*)	current	Other	Non-standard	4.45M
Botxt (*)	current	Bornholmish	CC0-1.0	2.01M
DDT (*)	current	Other	CC-BY-SA 4.0	546k
Sønderjysk (*)	current	South Jutlandic	CC0-1.0	140k

This list will be continuously updated with more data sources. Data will in part come from the collaboration with the Danish Language Model Consortium. Note that some datasets originate from Danish Gigaword, indicated in the table with (*).

Respect for Data Owners¶

We have the utmost respect for those who own data. We understand how important it is to protect and honour data owners' wishes regarding what their data may be used for. If you have any questions about the data we use, please do not hesitate to contact us. We are very open to dialogue and value your input, as it helps us improve our practices and ensure we meet data owners' expectations.