Skip to content

News

Bolette fights for the Danish language in the age of algorithms

Bolette Sandford Pedersen has been working with language models since 1989. Back then it was called machine translation systems, and artificial intelligence was not part of the equation. Today the field has exploded, and the professor from the University of Copenhagen has just been named to Denmark's Top 100 Women in AI. Meet the computational linguist from DFM who refuses to let closed, foreign models define our Danish language community.

Denmark's Strategic Effort for Artificial Intelligence

The Danish Ministry of Digitalisation has published a national strategy for artificial intelligence — Strategisk indsats for kunstig intelligens — outlining Denmark's ambitions and priorities for AI development and adoption. The strategy identifies four key initiatives, and Danish Foundation Models is directly at the heart of the third.

The Imperative of Danish Foundation Models: Bridging the Linguistic AI Divide

In recent years, the field of machine learning has experienced a transformative shift, primarily driven by the advent of foundation models. These models, pre-trained on vast amounts of data, can be finetuned for various downstream tasks, making them invaluable across multiple domains. However, the dominance of the English language in the development of these models poses significant challenges for smaller language communities. The Danish Foundation Models project emerges as a crucial initiative to ensure that the Danish language does not lag behind in this AI revolution.

Data Handling

Training large language models requires enormous amounts of data. From the moment we receive raw data to the point it can be used for model training, it goes through a transformation process.

The following is a high-level description of this process. We continuously develop and improve it to ensure we apply state-of-the-art methods and practices.

Data Sources

The data language models are trained on is decisive for what they can be used for. In Danish Foundation Models (DFM), our approach is to have certainty that we are permitted to use the data we train on from data owners, and to focus on value-creating use cases. We pursue this, among other ways, through our collaboration with the Danish Language Model Consortium.

Releasing Munin 7B Alpha - A Danish LLM

We are excited to announce the release of the first model from the Danish Foundation Models project, nicknamed Munin 7B Alpha. This model represents the beginning of our research into Danish Large Language Models (LLMs), employing continual pre-training based on the already pre-trained Mistral-7b-v0.1 model. It has been pre-trained on the Danish Gigaword dataset, which has been instrumental in training various Danish BERT-style models.

Why Danish Needs Its Own Foundation Models

Danish is one of the world's richest languages — but in the age of large language models, it risks becoming a digital second-class citizen. We published a position paper arguing why that matters, and what we're doing about it.