Data Sources¶
The data language models are trained on is decisive for what they can be used for. In Danish Foundation Models (DFM), our approach is to have certainty that we are permitted to use the data we train on from data owners, and to focus on value-creating use cases. We pursue this, among other ways, through our collaboration with the Danish Language Model Consortium.
Current Data Sources¶
We continuously work to gather data from more sources. The table below lists sources that, to the best of our knowledge, may currently be used to train a Danish language model. The amount of data we have at present is not sufficient to train a Danish language model from scratch. Sizes are given in number of characters.
| Dataset | Date | Domain | License | Size |
|---|---|---|---|---|
| AI aktindsigt | current | Municipal websites | CC0-1.0 | 408M |
| Domsdatabasen | 1855–now | Court rulings | CC0-1.0 | 91.2M |
| Eur-lex-sum-da | 1993–now | Law (EU) | CC-BY-SA 4.0 | 87.8M |
| FTSpeech | 2017–now | Parliamentary speeches | Non-standard | 244M |
| Scrape Hovedstaden | current | Health | CC0-1.0 | 79.9M |
| MeMo | 1870–1899 | Fiction | Public Domain | 319M |
| Wikipedia | current | Encyclopaedia | CC-BY-SA 4.0 | 498M |
| Retsinformation.dk (*) | current | Legislation | Non-standard (*) | 1.42G |
| Skat.dk (*) | current | Tax information | CC0-1.0 | 354M |
| H-Sø (*) | current | Court cases | CC0-1.0 | 204M |
| Hestenettet (*) | current | Forum | CC0-1.0 | 1.19G |
| Folketinget (*) | 2009–2019 | Debate | Non-standard | 351M |
| Europarl (*) | 2004–2008 | Debate | CC0-1.0 | 312M |
| Spontaneous Speech (*) | 2019 | Conversations | CC0-1.0 | 4.0M |
| NAAT (*) | 1930–now | Speeches | CC0-1.0 | 881k |
| Dansk Litteratur (*) | 1700–now | Literature | CC0-1.0 | 162M |
| Gutenberg (*) | 1700–now | Literature | Non-standard | 19.2M |
| WikiBooks (*) | 2019–2020 | Manuals | CC0-1.0 | 17.5M |
| WikiSource (*) | 1700–now | Literature | CC0-1.0 | 15.5M |
| Johannes V. Jensen (*) | – | JVJ's works | CC-BY-SA 4.0 | 10.7M |
| Religious Texts (*) | – | Religious | CC0-1.0 | 3.56M |
| TV2R (*) | 2015–2019 | News | CC-BY 4.0 | 64.04M |
| Dasem Data (*) | current | Other | Non-standard | 4.45M |
| Botxt (*) | current | Bornholmish | CC0-1.0 | 2.01M |
| DDT (*) | current | Other | CC-BY-SA 4.0 | 546k |
| Sønderjysk (*) | current | South Jutlandic | CC0-1.0 | 140k |
This list will be continuously updated with more data sources. Data will in part come from the collaboration with the Danish Language Model Consortium. Note that some datasets originate from Danish Gigaword, indicated in the table with (*).
Respect for Data Owners¶
We have the utmost respect for those who own data. We understand how important it is to protect and honour data owners' wishes regarding what their data may be used for. If you have any questions about the data we use, please do not hesitate to contact us. We are very open to dialogue and value your input, as it helps us improve our practices and ensure we meet data owners' expectations.