Rules
Framework Conditions
- I have set the maximum time for generating an answer to 30 minutes.
- I am using the OpenAI API from TGWUI for the tests.
- The maximum answer length is determined by what the LLM supports, what is recommended, or what is feasible with my hardware. The upper limit is 128k tokens.
- To avoid losses due to translation, all questions are in English except those about multilingualism. I have evaluated the translation tasks using 🔗Unbabel COMET.
- I ran the "small" models (those that fit in 8-bit quantization into a GPU with 24GB) on a Linux Mint 22.2 system with an Intel i9 12900-KF with 192 GB RAM and one Nvidia 3090.
- I ran the larger models on a Linux Mint 22.2 system with an AMD Threadripper 7860X with 128 GB RAM and 2 Nvidia 4090s.
- I mostly use GGUF as the model format, but for comparison purposes, I have also done some tests with models in the EXL3 format.
Quantization
- I generally use quantized models, usually in 8-bit quantization, if possible with my hardware. Particularly large models are in lower quantizations.
- The K/V cache is also in 8-bit quantization.
Reasoning
- For reasoning models, I test all possible reasoning settings, if possible.
Sampling Parameters
- I have tried to find a recommendation for the sampling parameters for each model. The recommendations mostly come from the respective model cards on Huggingface.
Evaluation of Questions:
- As a teacher's child, I naturally evaluated the answers using German school grades from 1 (very good) to 6 (insufficient).
- Every error I perceived as such led to a downgrade by one grade.
- If the model recognizes that it cannot answer a question (I call this the "Socratic" response), it receives a 4. Incorrect answers receive a 6.
-
I used gpt-5 for scoring the answers to question 12 („Wie behandelt man Fußpilz?“). I used the following prompt:
Prüfe den folgenden Text nach den folgenden Gesichtspunkten: * Orthografische Korrektheit * Grammatikalische Korrektheit * Der Text soll frei von Anglizismen (wie "antifungal", "Over the Counter" oder "OTC" sein) * Die folgenden Aspekte sollten berücksichtigt werden: - In Apotheken frei verkäufliche Medikamente und ihre Anwendung, Nebenwirkungen und Gefahren - Rezeptpflichtige Medikamente und ihre Anwendung, Nebenwirkungen und Gefahren - Hausmittel und ihre Anwendung und Wirksamkeit, Nebenwirkungen und Gefahren - Verhaltensmaßregeln zur Hygiene und Vorbeugung - Wann ist ein Arztbesuch notwendig? - Besondere Vorsicht bei Diabetes, Immunsuppression Liste die Fehler im Text auf Bewerte den Text mit einer deutschen Schulnote von 1 (sehr gut) bis 6 (ungenügend) und antworte mit der Note und Deinen Bemerkungen. Führe in den Bemerkungen alle von Dir gefundenen Fehler detailliert auf! Hier der Text:
I am fully aware that my evaluation may be biased by my subjective perception. I do not claim objectivity.