Rules

Framework Conditions

I have set the maximum time for generating an answer to 30 minutes.
I am using the OpenAI API from TGWUI for the tests.
The maximum answer length is determined by what the LLM supports, what is recommended, or what is feasible with my hardware. The upper limit is 128k tokens.
To avoid losses due to translation, all questions are in English except those about multilingualism. I have evaluated the translation tasks using 🔗Unbabel COMET.
I ran the "small" models (those that fit in 8-bit quantization into a GPU with 24GB) on a Linux Mint 22.2 system with an Intel i9 12900-KF with 192 GB RAM and one Nvidia 3090.
I ran the larger models on a Linux Mint 22.2 system with an AMD Threadripper 7860X with 128 GB RAM and 2 Nvidia 4090s.
I mostly use GGUF as the model format, but for comparison purposes, I have also done some tests with models in the EXL3 format.

Quantization

I generally use quantized models, usually in 8-bit quantization, if possible with my hardware. Particularly large models are in lower quantizations.
The K/V cache is also in 8-bit quantization.

Reasoning

For reasoning models, I test all possible reasoning settings, if possible.

Sampling Parameters

I have tried to find a recommendation for the sampling parameters for each model. The recommendations mostly come from the respective model cards on Huggingface.

Evaluation of Questions:

As a teacher's child, I naturally evaluated the answers using German school grades from 1 (very good) to 6 (insufficient).
Every error I perceived as such led to a downgrade by one grade.
If the model recognizes that it cannot answer a question (I call this the "Socratic" response), it receives a 4. Incorrect answers receive a 6.

I used gpt-5 for scoring the answers to question 12 („Wie behandelt man Fußpilz?“). I used the following prompt:

Prüfe den folgenden Text nach den folgenden Gesichtspunkten:
* Orthografische Korrektheit
* Grammatikalische Korrektheit
* Der Text soll frei von Anglizismen (wie "antifungal", "Over the Counter" oder "OTC" sein)
* Die folgenden Aspekte sollten berücksichtigt werden:
  - In Apotheken frei verkäufliche Medikamente und ihre Anwendung, Nebenwirkungen und Gefahren
  - Rezeptpflichtige Medikamente und ihre Anwendung, Nebenwirkungen und Gefahren
  - Hausmittel und ihre Anwendung und Wirksamkeit, Nebenwirkungen und Gefahren
  - Verhaltensmaßregeln zur Hygiene und Vorbeugung
  - Wann ist ein Arztbesuch notwendig?
  - Besondere Vorsicht bei Diabetes, Immunsuppression

Liste die Fehler im Text auf

Bewerte den Text mit einer deutschen Schulnote von 1 (sehr gut) bis 6 (ungenügend) und antworte mit
der Note und Deinen Bemerkungen. Führe in den Bemerkungen alle von Dir gefundenen Fehler
detailliert auf!

Hier der Text:

I am fully aware that my evaluation may be biased by my subjective perception. I do not claim objectivity.

Source:

🔗https://unbabel.github.io/COMET/html/index.html