iAsk Pro’s HELM MMLU
Benchmark Results
93.89%
Exceeds Human Experts
We are thrilled to present our performance results on the MMLU benchmark, the gold-standard academic benchmark for measuring the accuracy of large language models. The iAsk Pro model scored the highest with a remarkable 93.89% accuracy (download proof of our results), outperforming all current LLM models. Our performance exceeds that of the top GPT-4.0 model by over 10%.
We utilized Stanford’s HELM MMLU benchmark open-source evaluation framework to demonstrate this first-ever human expert accuracy, exceeding 89.8%. iAsk Pro even outperformed our own prior highest score, achieved by our publicly available free model, iAsk Expert, which scored 91.22%, also surpassing human expert performance.
Our results are transparent and downloadable, providing detailed scores for all 15,908 individual questions across 57 subjects. Additionally, our model’s written explanations for its reasoning behind selecting particular answer choices for all 15,908 questions are available, regardless of whether the answers were correct or incorrect.
The following bar graph highlights our top-ranking performance and compares it to top-scoring LLMs according to the HELM Leaderboard:
Detailed performance across 57 subjects demonstrates our model’s extensive knowledge and understanding. This bar graph presents our accuracy across various subjects:
These results underscore our commitment to advancing LLM capabilities and ensuring our models meet the highest standards of language understanding.
Understanding MMLU
MMLU (Massive Multitask Language Understanding) is a comprehensive benchmark designed to evaluate the language understanding capabilities of large language models (LLMs). Addressing the limitations of previous benchmarks such as GLUE and SuperGLUE, MMLU sets a higher standard for assessing how well these models understand and apply language across diverse and complex scenarios.
What Makes MMLU Unique?
The MMLU benchmark was created to bridge the gap between natural language processing (NLP) and natural language understanding (NLU). It aims to measure the true understanding and problem-solving abilities of LLMs using the knowledge they have been trained on.
What Makes the MMLU Benchmark a Gold-Standard for LLM Accuracy?
MMLU includes questions from various subjects, covering humanities, social sciences, hard sciences, and other challenging topics at different levels of depth, ranging from elementary to advanced professional levels. Some examples of specialized knowledge questions are shown in the provided image sources. These questions test a range of linguistic skills, requiring LLMs to demonstrate their understanding of complex ideas and relationships within and across disciplines.