DCC _v1¶

The DCC is a composite corpus consisting of the following subcorpora. For more information about the specific subcorpora, feel free to check out the individual datasheets.

Name	Description	Size	Open Access	Novel Corpus
Text
DAGW	Danish Gigaword	1B tokens	✓	✗
reddit-da	Danish Reddit	<.1B tokens	✓	✗
HopeTwitter	Danish Tweets	0.48B tokens	✗	✓
DaNews	Danish newspapers	0.5B tokens	✗	✓
Netarkivet Text	Danish internet	>100B tokens	✗	✓
Speech
DaRadio	Danish talk radio	140,000 hours	✗	✓
DaTV	Danish subtitled TV	900 hours	✗	✓

Collaborators and Data Owners¶

Data are provided in agreement with the data owners and data collaborators. The data is generally accecible by the research collaborators, though each data agreements has their own access restrictions and might not cover all research collaborators. Access restriction are specified on the server hosting the data in accordance with the data agreements.

Data Owners
Aviser / dagblade
Danmarks Statistik
NetArkivet
Data Collaborators
Det Kongelige bibliotek
Infomedia
Research Collaborators
Center for humanities Computing, Aarhus Universitet
Alexandra Institutet
Peter Schneider-Kamp, Syddansk Universitet

DCC v1¶

Collaborators and Data Owners¶

DCC _v1¶