Skip to content

Dataset Card for Nordjylland News

Articles from the Danish Newspaper TV2 Nord.

The data is derived from the Huggingface dataset alexandrainst/nordjylland-news-summarization originally intended for text summarization.

Dataset Description

  • Number of samples: 75.22K
  • Number of tokens (Llama 3): 37.90M
  • Average document length in tokens (min, max): 503.9497440670079 (29, 12.26K)

Dataset Structure

An example from the dataset looks as follows.

{
  "id": "nordjyllandnews_0",
  "text": "Lav et referat af nedenstående tekst:\n\nTekst:\nOpdatering: Manden er nu fundet af Nordjyllands Politi[...]",
  "source": "nordjyllandnews",
  "added": "2024-12-16",
  "created": "2000-01-01, 2024-01-01",
  "token_count": 628
}

Data Fields

An entry in the dataset consists of the following fields:

  • id (str): An unique identifier for each document.
  • text(str): The content of the document.
  • source (str): The source of the document.
  • added (str): An date for when the document was added to this collection.
  • created (str): An date range for when the document was originally created.
  • token_count (int): The number of tokens in the sample computed using the Llama 8B tokenizer

Dataset Statistics

Additional Information

Opportunities for Improvement

An updated version of the this data could be fetched from their API.

Sourced data

This dataset is derived from alexandrainst/nordjylland-news-summarization

Citation Information

No citation is applicable for this work. We recommend citing the huggingface repository.