Skip to content

Data Sources

The data language models are trained on is decisive for what they can be used for. In Danish Foundation Models (DFM), our approach is to have certainty that we are permitted to use the data we train on from data owners, and to focus on value-creating use cases. We pursue this, among other ways, through our collaboration with the Danish Language Model Consortium.

Current Data Sources

We continuously work to gather data from more sources. The table below lists sources that, to the best of our knowledge, may currently be used to train a Danish language model. The amount of data we have at present is not sufficient to train a Danish language model from scratch. Sizes are given in number of characters.

Dataset Date Domain License Size
AI aktindsigt current Municipal websites CC0-1.0 408M
Domsdatabasen 1855–now Court rulings CC0-1.0 91.2M
Eur-lex-sum-da 1993–now Law (EU) CC-BY-SA 4.0 87.8M
FTSpeech 2017–now Parliamentary speeches Non-standard 244M
Scrape Hovedstaden current Health CC0-1.0 79.9M
MeMo 1870–1899 Fiction Public Domain 319M
Wikipedia current Encyclopaedia CC-BY-SA 4.0 498M
Retsinformation.dk (*) current Legislation Non-standard (*) 1.42G
Skat.dk (*) current Tax information CC0-1.0 354M
H-Sø (*) current Court cases CC0-1.0 204M
Hestenettet (*) current Forum CC0-1.0 1.19G
Folketinget (*) 2009–2019 Debate Non-standard 351M
Europarl (*) 2004–2008 Debate CC0-1.0 312M
Spontaneous Speech (*) 2019 Conversations CC0-1.0 4.0M
NAAT (*) 1930–now Speeches CC0-1.0 881k
Dansk Litteratur (*) 1700–now Literature CC0-1.0 162M
Gutenberg (*) 1700–now Literature Non-standard 19.2M
WikiBooks (*) 2019–2020 Manuals CC0-1.0 17.5M
WikiSource (*) 1700–now Literature CC0-1.0 15.5M
Johannes V. Jensen (*) JVJ's works CC-BY-SA 4.0 10.7M
Religious Texts (*) Religious CC0-1.0 3.56M
TV2R (*) 2015–2019 News CC-BY 4.0 64.04M
Dasem Data (*) current Other Non-standard 4.45M
Botxt (*) current Bornholmish CC0-1.0 2.01M
DDT (*) current Other CC-BY-SA 4.0 546k
Sønderjysk (*) current South Jutlandic CC0-1.0 140k

This list will be continuously updated with more data sources. Data will in part come from the collaboration with the Danish Language Model Consortium. Note that some datasets originate from Danish Gigaword, indicated in the table with (*).

Respect for Data Owners

We have the utmost respect for those who own data. We understand how important it is to protect and honour data owners' wishes regarding what their data may be used for. If you have any questions about the data we use, please do not hesitate to contact us. We are very open to dialogue and value your input, as it helps us improve our practices and ensure we meet data owners' expectations.