Munin 1.0 Full Evaluation Results¶
Scores are percentages. Standard errors are percentage points. Deltas are Munin - Original; Danish deltas were computed from the reported scores with propagated standard errors.
Task groupings use the EuroEval Danish task taxonomy for Danish datasets. Aggregate rows are unweighted means across evals in the task; aggregate standard error is sqrt(sum(SE_i^2)) / n, assuming independent evaluation estimates.
| Suite | Task | Metric | Benchmark | Apertus Original ± SE | Apertus Munin ± SE | Apertus Δ ± SE | Ministral Original ± SE | Ministral Munin ± SE | Ministral Δ ± SE | Qwen Original ± SE | Qwen Munin ± SE | Qwen Δ ± SE |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Danish | Common-sense Reasoning | MCC | Average (3 evals) | 33.11 ± 0.86 | 29.40 ± 1.28 | -3.71 ± 1.54 | 52.37 ± 1.19 | 50.21 ± 1.29 | -2.16 ± 1.76 | 64.64 ± 0.72 | 62.22 ± 0.88 | -2.42 ± 1.14 |
| Danish | Common-sense Reasoning | MCC | goldenswag-da | 47.66 ± 1.23 | 38.65 ± 2.50 | -9.01 ± 2.79 | 68.89 ± 1.49 | 64.52 ± 1.51 | -4.37 ± 2.12 | 85.94 ± 0.79 | 78.94 ± 1.19 | -7.00 ± 1.43 |
| Danish | Common-sense Reasoning | MCC | hellaswag-da | 39.19 ± 1.01 | 36.89 ± 1.88 | -2.30 ± 2.13 | 55.79 ± 2.04 | 51.80 ± 2.79 | -3.99 ± 3.46 | 73.28 ± 0.99 | 67.50 ± 1.35 | -5.78 ± 1.67 |
| Danish | Common-sense Reasoning | MCC | winogrande-da | 12.48 ± 2.02 | 12.65 ± 2.25 | +0.17 ± 3.02 | 32.43 ± 2.54 | 34.32 ± 2.21 | +1.89 ± 3.37 | 34.71 ± 1.75 | 40.22 ± 1.93 | +5.51 ± 2.61 |
| Danish | Grammatical Error Detection | micro-F1 | Average (1 eval) | 18.00 ± 1.31 | 17.35 ± 1.06 | -0.65 ± 1.69 | 21.72 ± 2.04 | 17.66 ± 0.74 | -4.06 ± 2.17 | 20.37 ± 1.25 | 20.74 ± 1.14 | +0.37 ± 1.69 |
| Danish | Grammatical Error Detection | micro-F1 | gerlangmod-da | 18.00 ± 1.31 | 17.35 ± 1.06 | -0.65 ± 1.69 | 21.72 ± 2.04 | 17.66 ± 0.74 | -4.06 ± 2.17 | 20.37 ± 1.25 | 20.74 ± 1.14 | +0.37 ± 1.69 |
| Danish | Instruction-following | Accuracy | Average (1 eval) | 69.00 ± 1.06 | 51.43 ± 1.38 | -17.57 ± 1.74 | 66.67 ± 1.31 | 74.38 ± 0.94 | +7.71 ± 1.61 | 81.56 ± 0.85 | 77.88 ± 0.89 | -3.68 ± 1.23 |
| Danish | Instruction-following | Accuracy | ifeval-da | 69.00 ± 1.06 | 51.43 ± 1.38 | -17.57 ± 1.74 | 66.67 ± 1.31 | 74.38 ± 0.94 | +7.71 ± 1.61 | 81.56 ± 0.85 | 77.88 ± 0.89 | -3.68 ± 1.23 |
| Danish | Knowledge | MCC | Average (5 evals) | 58.89 ± 0.73 | 62.29 ± 0.72 | +3.41 ± 1.03 | 73.58 ± 0.52 | 68.48 ± 0.58 | -5.09 ± 0.78 | 75.98 ± 0.52 | 77.64 ± 0.50 | +1.67 ± 0.72 |
| Danish | Knowledge | MCC | arc-da | 63.02 ± 1.58 | 65.87 ± 1.43 | +2.85 ± 2.13 | 84.17 ± 0.80 | 81.86 ± 0.77 | -2.31 ± 1.11 | 88.30 ± 0.80 | 88.82 ± 0.72 | +0.52 ± 1.08 |
| Danish | Knowledge | MCC | dameta | 62.52 ± 1.58 | 67.63 ± 1.66 | +5.11 ± 2.29 | 74.93 ± 1.21 | 66.52 ± 1.48 | -8.41 ± 1.91 | 76.70 ± 0.68 | 78.14 ± 1.01 | +1.44 ± 1.22 |
| Danish | Knowledge | MCC | danish-citizen-tests | 72.31 ± 2.18 | 73.25 ± 2.06 | +0.94 ± 3.00 | 78.01 ± 0.94 | 71.59 ± 1.71 | -6.42 ± 1.95 | 76.92 ± 1.67 | 82.16 ± 1.53 | +5.24 ± 2.26 |
| Danish | Knowledge | MCC | danske-talemaader | 55.11 ± 1.80 | 62.44 ± 1.68 | +7.33 ± 2.46 | 73.14 ± 1.56 | 68.47 ± 1.40 | -4.67 ± 2.10 | 75.67 ± 1.43 | 76.72 ± 0.92 | +1.05 ± 1.70 |
| Danish | Knowledge | MCC | mmlu-da | 41.47 ± 0.58 | 42.27 ± 1.11 | +0.80 ± 1.25 | 57.63 ± 1.11 | 53.97 ± 0.94 | -3.66 ± 1.45 | 62.30 ± 0.90 | 62.38 ± 1.20 | +0.08 ± 1.50 |
| Danish | Linguistic Acceptability | MCC | Average (2 evals) | 32.95 ± 1.19 | 29.34 ± 2.38 | -3.62 ± 2.66 | 43.37 ± 1.85 | 18.89 ± 3.06 | -24.48 ± 3.57 | 49.25 ± 1.68 | 52.19 ± 1.30 | +2.95 ± 2.12 |
| Danish | Linguistic Acceptability | MCC | dala | 29.38 ± 1.82 | 25.56 ± 3.12 | -3.82 ± 3.61 | 39.49 ± 2.66 | 16.99 ± 4.12 | -22.50 ± 4.90 | 45.65 ± 2.09 | 49.07 ± 1.55 | +3.42 ± 2.60 |
| Danish | Linguistic Acceptability | MCC | scala-da | 36.53 ± 1.53 | 33.11 ± 3.60 | -3.42 ± 3.91 | 47.25 ± 2.58 | 20.79 ± 4.52 | -26.46 ± 5.20 | 52.84 ± 2.62 | 55.31 ± 2.08 | +2.47 ± 3.35 |
| Danish | Multiple-choice Reading Comprehension | MCC | Average (1 eval) | 67.09 ± 1.02 | 65.99 ± 2.04 | -1.10 ± 2.28 | 85.94 ± 1.39 | 84.42 ± 1.12 | -1.52 ± 1.79 | 87.19 ± 1.17 | 87.35 ± 1.27 | +0.16 ± 1.73 |
| Danish | Multiple-choice Reading Comprehension | MCC | belebele-da | 67.09 ± 1.02 | 65.99 ± 2.04 | -1.10 ± 2.28 | 85.94 ± 1.39 | 84.42 ± 1.12 | -1.52 ± 1.79 | 87.19 ± 1.17 | 87.35 ± 1.27 | +0.16 ± 1.73 |
| Danish | Named Entity Recognition | micro-F1 | Average (2 evals) | 49.33 ± 1.40 | 47.60 ± 1.33 | -1.72 ± 1.93 | 61.13 ± 1.01 | 51.41 ± 1.76 | -9.72 ± 2.03 | 69.12 ± 1.16 | 69.63 ± 1.23 | +0.51 ± 1.69 |
| Danish | Named Entity Recognition | micro-F1 | dane | 51.69 ± 1.30 | 48.66 ± 1.53 | -3.03 ± 2.01 | 66.16 ± 1.42 | 53.80 ± 1.51 | -12.36 ± 2.07 | 74.51 ± 0.75 | 76.33 ± 0.85 | +1.82 ± 1.13 |
| Danish | Named Entity Recognition | micro-F1 | dansk | 46.97 ± 2.48 | 46.55 ± 2.18 | -0.42 ± 3.30 | 56.11 ± 1.44 | 49.02 ± 3.19 | -7.09 ± 3.50 | 63.74 ± 2.20 | 62.94 ± 2.31 | -0.80 ± 3.19 |
| Danish | Natural Language Inference | MCC | Average (2 evals) | 48.80 ± 2.34 | 52.09 ± 2.62 | +3.29 ± 3.51 | 25.75 ± 1.58 | 58.18 ± 2.22 | +32.42 ± 2.73 | 53.80 ± 1.93 | 65.63 ± 2.01 | +11.83 ± 2.79 |
| Danish | Natural Language Inference | MCC | danish-entailment | 57.24 ± 3.67 | 64.03 ± 4.45 | +6.79 ± 5.77 | 51.51 ± 3.17 | 55.54 ± 3.61 | +4.03 ± 4.80 | 62.10 ± 2.55 | 67.72 ± 1.46 | +5.62 ± 2.94 |
| Danish | Natural Language Inference | MCC | danish-lexical-inference | 40.37 ± 2.90 | 40.15 ± 2.76 | -0.22 ± 4.00 | 0.00 ± 0.00 | 60.82 ± 2.59 | +60.82 ± 2.59 | 45.51 ± 2.91 | 63.55 ± 3.75 | +18.04 ± 4.75 |
| Danish | Reading Comprehension | F1 | Average (2 evals) | 70.75 ± 0.51 | 69.41 ± 0.56 | -1.35 ± 0.76 | 69.71 ± 0.72 | 71.23 ± 0.81 | +1.52 ± 1.08 | 70.84 ± 0.55 | 71.96 ± 0.68 | +1.12 ± 0.88 |
| Danish | Reading Comprehension | F1 | multi-wiki-qa-da | 77.94 ± 0.97 | 74.11 ± 1.10 | -3.83 ± 1.47 | 75.77 ± 1.34 | 79.52 ± 1.43 | +3.75 ± 1.96 | 79.28 ± 0.78 | 79.51 ± 1.20 | +0.23 ± 1.43 |
| Danish | Reading Comprehension | F1 | scandiqa-da | 63.57 ± 0.34 | 64.70 ± 0.25 | +1.13 ± 0.42 | 63.65 ± 0.52 | 62.95 ± 0.76 | -0.70 ± 0.92 | 62.39 ± 0.78 | 64.40 ± 0.66 | +2.01 ± 1.02 |
| Danish | Sentiment Classification | MCC | Average (2 evals) | 57.89 ± 0.97 | 54.30 ± 1.10 | -3.59 ± 1.46 | 60.37 ± 0.97 | 59.56 ± 1.41 | -0.81 ± 1.71 | 64.89 ± 0.93 | 64.69 ± 1.00 | -0.20 ± 1.36 |
| Danish | Sentiment Classification | MCC | angry-tweets | 52.17 ± 0.96 | 48.76 ± 0.94 | -3.41 ± 1.34 | 52.91 ± 0.72 | 52.71 ± 1.32 | -0.20 ± 1.50 | 55.73 ± 1.16 | 56.26 ± 1.07 | +0.53 ± 1.58 |
| Danish | Sentiment Classification | MCC | danish-sentiment-in-context | 63.61 ± 1.68 | 59.84 ± 1.99 | -3.77 ± 2.60 | 67.83 ± 1.81 | 66.40 ± 2.49 | -1.43 ± 3.08 | 74.05 ± 1.45 | 73.11 ± 1.68 | -0.94 ± 2.22 |
| Danish | Summarization | chrF++ | Average (1 eval) | 37.56 ± 0.20 | 36.88 ± 0.20 | -0.68 ± 0.28 | 35.09 ± 0.35 | 36.97 ± 0.21 | +1.88 ± 0.41 | 36.51 ± 0.28 | 36.66 ± 0.44 | +0.15 ± 0.52 |
| Danish | Summarization | chrF++ | nordjylland-news | 37.56 ± 0.20 | 36.88 ± 0.20 | -0.68 ± 0.28 | 35.09 ± 0.35 | 36.97 ± 0.21 | +1.88 ± 0.41 | 36.51 ± 0.28 | 36.66 ± 0.44 | +0.15 ± 0.52 |
| Danish | Word-in-Context | MCC | Average (1 eval) | 11.83 ± 2.20 | 8.71 ± 3.46 | -3.12 ± 4.10 | 29.87 ± 1.70 | 23.27 ± 3.18 | -6.60 ± 3.61 | 44.60 ± 2.06 | 40.11 ± 3.46 | -4.49 ± 4.03 |
| Danish | Word-in-Context | MCC | danwic | 11.83 ± 2.20 | 8.71 ± 3.46 | -3.12 ± 4.10 | 29.87 ± 1.70 | 23.27 ± 3.18 | -6.60 ± 3.61 | 44.60 ± 2.06 | 40.11 ± 3.46 | -4.49 ± 4.03 |
| English | Common-sense Reasoning | Accuracy | Average (1 eval) | 58.70 ± 0.50 | 23.20 ± 0.40 | -35.50 ± 0.60 | 73.10 ± 0.40 | 59.60 ± 0.50 | -13.50 ± 0.70 | 90.00 ± 0.30 | 85.70 ± 0.30 | -4.30 ± 0.50 |
| English | Common-sense Reasoning | Accuracy | HellaSwag | 58.7 ± 0.5 | 23.2 ± 0.4 | -35.5 ± 0.6 | 73.1 ± 0.4 | 59.6 ± 0.5 | -13.5 ± 0.7 | 90.0 ± 0.3 | 85.7 ± 0.3 | -4.3 ± 0.5 |
| English | Instruction-following | Accuracy | Average (1 eval) | 73.30 ± 1.90 | 54.70 ± 2.00 | -18.50 ± 2.70 | 70.40 ± 1.80 | 69.80 ± 1.90 | -0.60 ± 2.70 | 89.60 ± 1.50 | 78.60 ± 1.80 | -11.00 ± 2.30 |
| English | Instruction-following | Accuracy | IFEval | 73.3 ± 1.9 | 54.7 ± 2.0 | -18.5 ± 2.7 | 70.4 ± 1.8 | 69.8 ± 1.9 | -0.6 ± 2.7 | 89.6 ± 1.5 | 78.6 ± 1.8 | -11.0 ± 2.3 |
| English | Knowledge | Accuracy | Average (4 evals) | 50.27 ± 0.46 | 41.92 ± 0.45 | -8.35 ± 0.65 | 81.72 ± 0.26 | 72.97 ± 0.29 | -8.75 ± 0.42 | 79.15 ± 0.22 | 82.40 ± 0.24 | +3.23 ± 0.33 |
| English | Knowledge | Accuracy | ARC-C | 55.5 ± 1.5 | 55.0 ± 1.5 | -0.5 ± 2.1 | 90.8 ± 0.8 | 88.5 ± 0.9 | -2.3 ± 1.3 | 96.1 ± 0.6 | 93.8 ± 0.7 | -2.3 ± 0.9 |
| English | Knowledge | Accuracy | ARC-E | 72.9 ± 0.9 | 74.2 ± 0.9 | +1.3 ± 1.3 | 96.3 ± 0.4 | 94.6 ± 0.5 | -1.7 ± 0.6 | 98.5 ± 0.3 | 98.3 ± 0.3 | -0.2 ± 0.4 |
| English | Knowledge | Accuracy | MMLU | 39.8 ± 0.4 | 31.7 ± 0.4 | -8.1 ± 0.6 | 71.3 ± 0.4 | 67.8 ± 0.4 | -3.6 ± 0.6 | 41.4 ± 0.4 | 76.6 ± 0.4 | +35.2 ± 0.6 |
| English | Knowledge | Accuracy | MMLU-Pro | 32.9 ± 0.4 | 6.8 ± 0.2 | -26.1 ± 0.5 | 68.5 ± 0.4 | 41.0 ± 0.4 | -27.4 ± 0.6 | 80.6 ± 0.4 | 60.9 ± 0.4 | -19.8 ± 0.6 |
| English | Long-context | Accuracy | Average (1 eval) | 34.60 ± 2.10 | 35.80 ± 2.10 | +1.20 ± 3.00 | 51.40 ± 2.20 | 49.40 ± 2.20 | -2.00 ± 3.20 | 67.20 ± 2.10 | 54.60 ± 2.20 | -12.60 ± 3.10 |
| English | Long-context | Accuracy | RULER 32k | 34.6 ± 2.1 | 35.8 ± 2.1 | +1.2 ± 3.0 | 51.4 ± 2.2 | 49.4 ± 2.2 | -2.0 ± 3.2 | 67.2 ± 2.1 | 54.6 ± 2.2 | -12.6 ± 3.1 |
| English | Math | Accuracy | Average (1 eval) | 68.10 ± 1.30 | 56.70 ± 1.40 | -11.40 ± 1.90 | 92.20 ± 0.70 | 82.30 ± 1.10 | -9.90 ± 1.30 | 94.80 ± 0.60 | 92.20 ± 0.70 | -2.60 ± 1.00 |
| English | Math | Accuracy | GSM8K | 68.1 ± 1.3 | 56.7 ± 1.4 | -11.4 ± 1.9 | 92.2 ± 0.7 | 82.3 ± 1.1 | -9.9 ± 1.3 | 94.8 ± 0.6 | 92.2 ± 0.7 | -2.6 ± 1.0 |
| English | Truthfulness | Accuracy | Average (1 eval) | 16.80 ± 1.30 | 15.70 ± 1.30 | -1.10 ± 1.80 | 64.70 ± 1.70 | 63.30 ± 1.70 | -1.50 ± 2.40 | 78.10 ± 1.40 | 74.20 ± 1.50 | -3.90 ± 2.10 |
| English | Truthfulness | Accuracy | TruthfulQA | 16.8 ± 1.3 | 15.7 ± 1.3 | -1.1 ± 1.8 | 64.7 ± 1.7 | 63.3 ± 1.7 | -1.5 ± 2.4 | 78.1 ± 1.4 | 74.2 ± 1.5 | -3.9 ± 2.1 |
| Agentic | Code | pass@1 | Average (2 evals) | 46.75 ± 2.49 | 39.20 ± 2.38 | -7.55 ± 3.44 | 75.05 ± 2.13 | 49.20 ± 2.31 | -25.85 ± 3.13 | 82.95 ± 1.84 | 77.20 ± 2.06 | -5.75 ± 2.72 |
| Agentic | Code | pass@1 | HumanEval | 42.1 ± 3.9 | 29.9 ± 3.6 | -12.2 ± 5.3 | 76.2 ± 3.3 | 29.3 ± 3.6 | -47.0 ± 4.9 | 87.8 ± 2.6 | 80.5 ± 3.1 | -7.3 ± 4.0 |
| Agentic | Code | pass@1 | MBPP p@1 | 51.4 ± 3.1 | 48.5 ± 3.1 | -2.9 ± 4.4 | 73.9 ± 2.7 | 69.1 ± 2.9 | -4.7 ± 3.9 | 78.1 ± 2.6 | 73.9 ± 2.7 | -4.2 ± 3.7 |
| Agentic | Tool Calling | Accuracy | Average (1 eval) | 52.40 ± 0.80 | 43.10 ± 0.80 | -9.30 ± 1.10 | 75.00 ± 0.70 | 49.20 ± 0.80 | -25.80 ± 1.00 | 79.40 ± 0.60 | 75.80 ± 0.70 | -3.60 ± 0.90 |
| Agentic | Tool Calling | Accuracy | BFCL | 52.4 ± 0.8 | 43.1 ± 0.8 | -9.3 ± 1.1 | 75.0 ± 0.7 | 49.2 ± 0.8 | -25.8 ± 1.0 | 79.4 ± 0.6 | 75.8 ± 0.7 | -3.6 ± 0.9 |