DCC v1¶
The DCC is a composite corpus consisting of the following subcorpora. For more information about the specific subcorpora, feel free to check out the individual datasheets.
Name | Description | Size | Open Access | Novel Corpus |
---|---|---|---|---|
Text | ||||
DAGW | Danish Gigaword | 1B tokens | ✓ | ✗ |
reddit-da | Danish Reddit | <.1B tokens | ✓ | ✗ |
HopeTwitter | Danish Tweets | 0.48B tokens | ✗ | ✓ |
DaNews | Danish newspapers | 0.5B tokens | ✗ | ✓ |
Netarkivet Text | Danish internet | >100B tokens | ✗ | ✓ |
Speech | ||||
DaRadio | Danish talk radio | 140,000 hours | ✗ | ✓ |
DaTV | Danish subtitled TV | 900 hours | ✗ | ✓ |
Collaborators and Data Owners¶
Data are provided in agreement with the data owners and data collaborators. The data is generally accecible by the research collaborators, though each data agreements has their own access restrictions and might not cover all research collaborators. Access restriction are specified on the server hosting the data in accordance with the data agreements.
- Data Owners
- Aviser / dagblade
- Danmarks Statistik
- NetArkivet
- Data Collaborators
- Det Kongelige bibliotek
- Infomedia
- Research Collaborators
- Center for humanities Computing, Aarhus Universitet
- Alexandra Institutet
- Peter Schneider-Kamp, Syddansk Universitet