iAsk Pro’s MMLU Pro Benchmark Results 85.85%
Exceeds Human Experts and Reaches AGI Level

The MMLU-Pro dataset is the most comprehensive and demanding multi-task understanding dataset created to-date. Comprising over 12,000 intricate questions spanning various fields, the dataset is designed to evaluate the capabilities of large language models more thoroughly than ever before. MMLU-Pro, the open-source evaluation framework from TIGER AI Lab, aims to push the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. Please see the official external MMLU Pro Leaderboard on Hugging Face, listing us, iAsk Pro, as the #1 scoring AI overall and as #1 ranked AI in every single subject area.

We are excited to share our performance results on the MMLU-Pro benchmark, the newly updated gold-standard academic benchmark for assessing the accuracy of large language models.

The iAsk Pro model scored the highest with a remarkable 85.85% accuracy (download raw proof of our results), proving that our iAsk Pro model outperformed all the current (leading) LLM models, including OpenAI’s best performing model, GPT-4o, by over 13 percentage points. Please view and analyze this spreadsheet to see a cross-comparison of the test answers (15% of the total) that iAsk Pro answered correctly but GPT-4o answered incorrectly, and vice versa (GPT-4o only answered 1% of questions correctly where iAsk Pro answered incorrectly).

The above bar graph illustrates our top-ranking performance and compares it to other leading LLMs.

We utilized Chain-of-Thought (CoT) reasoning to demonstrate this first-ever human expert accuracy on the MMLU-Pro benchmark, exceeding 78%. Finally, we slightly surpassed what is estimated to be an expert AGI level, meaning the score of top 10% of human experts, or 85%. (Explanation of AGI is located at the bottom of this article.)

Our results are transparent and downloadable, offering detailed scores for all 12,032 individual questions across 14 subjects. Furthermore, our model’s written explanations for its reasoning behind selecting specific answer choices for each of the 12,032 questions are accessible, irrespective of whether the answers were correct or incorrect.

Detailed Performance Across 14 Subjects

Our model’s extensive knowledge and understanding are demonstrated through detailed performance metrics across 14 subjects. This bar graph illustrates our accuracy in those subjects:

Understanding MMLU-Pro

MMLU Pro (Massive Multitask Language Understanding - Pro Benchmark Test) is an advanced evaluation framework designed to assess the reasoning and language understanding capabilities of large language models (LLMs). Building upon the original Massive Multitask Language Understanding (MMLU) benchmark, MMLU-Pro addresses key limitations by introducing more complex, reasoning-intensive questions and expanding the number of answer options from four to ten. This increase in distractors significantly enhances the difficulty level, reducing the likelihood of correct guesses based on chance and ensuring a more robust evaluation of model performance across various domains. MMLU-Pro is an advanced benchmark designed to evaluate the capabilities of large-scale language models (LLMs) in a more robust and challenging manner compared to its predecessor.

Differences Between MMLU-Pro and Original MMLU

The primary differences between MMLU-Pro and the original MMLU benchmark lie in the complexity and nature of the questions, as well as the structure of the answer choices. While MMLU primarily focused on knowledge-driven questions with a four-option multiple-choice format, MMLU-Pro integrates more challenging reasoning-focused questions and expands the answer choices to ten options. This change significantly increases the difficulty level, as evidenced by a 16% to 33% drop in accuracy for models tested on MMLU-Pro compared to those tested on MMLU. Additionally, MMLU-Pro eliminates trivial and noisy questions that were present in MMLU, ensuring that only high-quality, discriminative questions are included.

In constructing MMLU-Pro, researchers integrated over 12,000 questions spanning 14 diverse disciplines, including mathematics, physics, chemistry, law, and psychology.

As mentioned above, the dataset underwent rigorous filtering to eliminate trivial or erroneous questions and was subjected to two rounds of expert review to ensure accuracy and appropriateness. This meticulous process resulted in a benchmark that not only challenges LLMs more effectively but also provides greater stability in performance assessments across different prompting styles. Experimental results indicate that leading models experience a substantial drop in accuracy when evaluated with MMLU-Pro compared to the original MMLU, highlighting its effectiveness as a discriminative tool for tracking advancements in AI capabilities.

Performance gap between MMLU and MMLU-Pro

Dataset Restructuring for MMLU-Pro

The original MMLU dataset’s 57 subject categories were merged into 14 broader categories to focus on key knowledge areas and reduce redundancy. The following steps were taken to ensure data purity and a thorough final dataset:

Initial Filtering: Questions answered correctly by more than four out of eight evaluated models were considered too easy and excluded, resulting in the removal of 5,886 questions.
Question Sources: Additional questions were incorporated from the STEM Website, TheoremQA, and SciBench to expand the dataset.
Answer Extraction: GPT-4-Turbo was used to extract short answers from solutions provided by the STEM Website and TheoremQA, with manual verification to ensure accuracy.
Option Augmentation: Each question’s options were increased from four to ten using GPT-4-Turbo, introducing plausible distractors to enhance difficulty.
Expert Review Process: Conducted in two phases—verification of correctness and appropriateness, and ensuring distractor validity—to maintain dataset quality.
Incorrect Answers: Errors were identified from both pre-existing issues in the MMLU dataset and flawed answer extraction from the STEM Website.
False Negative Options: Distractors misclassified as incorrect were identified and reviewed by human experts to ensure they were indeed incorrect.
Bad Questions: Questions requiring non-textual information or unsuitable for multiple-choice format were removed.
Model Evaluation: Eight models including Llama-2-7B, Llama-2-13B, Mistral-7B, Gemma-7B, Yi-6B, and their chat variants were used for initial filtering.
Distribution of Issues: Table 1 categorizes identified issues into incorrect answers, false negative options, and bad questions across different sources.
Manual Verification: Human experts manually compared solutions with extracted answers to remove incomplete or incorrect ones.
Difficulty Enhancement: The augmentation process aimed to lower the likelihood of guessing correct answers, thus increasing benchmark robustness.
Average Options Count: On average, each question in the final dataset has 9.47 options, with 83% having ten options and 17% having fewer.
Quality Assurance: The expert review ensured that all distractors are distinctly different from correct answers and that each question is suitable for a multiple-choice format.

Impact on Model Performance (MMLU-Pro vs Original MMLU)

The introduction of more complex reasoning questions in MMLU-Pro has a notable impact on model performance. Experimental results show that models experience a significant drop in accuracy when transitioning from MMLU to MMLU-Pro. This drop highlights the increased challenge posed by the new benchmark and underscores its effectiveness in distinguishing between different levels of model capabilities. Furthermore, models utilizing Chain of Thought (CoT) reasoning techniques perform better on MMLU-Pro than those relying on direct answering methods, indicating that CoT reasoning is better suited for handling complex questions.

Performance Comparison: MMLU vs. MMLU-Pro

Why is the MMLU PRO a New Gold Standard for measuring LLM Accuracy?

The findings from evaluations using MMLU-Pro reveal significant insights into model performance disparities. For instance, top-performing models like GPT-4o achieved an accuracy of 72.6%, indicating considerable room for improvement compared to previous benchmarks where scores clustered around 86-87%. Furthermore, error analyses showed that many mispredictions stemmed from flaws in reasoning processes or lack of specific domain expertise.

Elimination of Trivial Questions

MMLU-Pro’s elimination of trivial and noisy questions is another significant enhancement over the original benchmark. By removing these less challenging items, MMLU-Pro ensures that all included questions contribute meaningfully to assessing a model’s language understanding and reasoning abilities. This refinement helps create a more discriminative benchmark that better tracks progress in AI development.

Distribution of Issues Identified during the Expert Review Process

Chain of Thought Reasoning

The findings related to Chain of Thought (CoT) reasoning are particularly noteworthy. Unlike direct answering methods which may struggle with complex queries, CoT reasoning involves breaking down problems into smaller steps or chains of thought before arriving at an answer.

Models Performance on MMLU-Pro, CoT. Values are accuracies in percentages. — Models Performance on **MMLU-Pro**, **CoT**. Values are accuracies in percentages. *(All the models use 5 shots except Gemini-1.5-pro and Gemini-1.5-Flash, which use 0 shots.)*

Models employing CoT techniques demonstrated superior performance on MMLU-Pro compared to their performance on simpler benchmarks like MMLU. This suggests that CoT reasoning is crucial for tackling intricate language tasks effectively.

Benchmark Sensitivity Reduction in MMLU Pro

Reducing benchmark sensitivity is essential for achieving reliable evaluations across various conditions. The decreased sensitivity observed with MMLU-Pro means that models are less affected by changes in prompt styles or other variables during testing. This improvement enhances the robustness of evaluations conducted using this benchmark and ensures that results are reflective of true model capabilities rather than artifacts introduced by specific test conditions.

MMLU-PRO Summary

MMLU-Pro represents a significant advancement over previous benchmarks like MMLU, offering a more rigorous assessment framework for large-scale language models. By incorporating complex reasoning-focused questions, expanding answer choices, eliminating trivial items, and demonstrating greater stability under varying prompts, MMLU-Pro provides a comprehensive tool for evaluating AI progress. The success of Chain of Thought reasoning techniques further underscores the importance of sophisticated problem-solving approaches in achieving high performance on this challenging benchmark.

MMLU-Pro represents a crucial step forward in benchmarking LLMs by emphasizing complex reasoning tasks and providing a clearer picture of their true capabilities.

Since our iAsk Pro Model reached an expert AGI level on the MMLU Pro Benchmark, we describe Artificial General Intelligence (AGI) and how it’s measured in the following section.

What is AGI?

Artificial General Intelligence (AGI) is a type of artificial intelligence that matches or surpasses human capabilities across a wide range of cognitive tasks. Unlike narrow AI, which excels in specific tasks such as language translation or game playing, AGI possesses the flexibility and adaptability to handle any intellectual task that a human can. This includes not only mastering specific domains but also transferring knowledge across various fields, displaying creativity, and solving novel problems. The ultimate goal of AGI is to create systems that can perform any task that a human being is capable of, thereby achieving a level of generality and autonomy akin to human intelligence.

How AGI Is Measured?

Google’s DeepMind has proposed a framework for classifying AGI into different levels to provide a common standard for evaluating AI models. This framework draws inspiration from the six-level system used in autonomous driving, which clarifies progress in that field. The levels defined by DeepMind range from “emerging” to “superhuman.” An emerging AGI is comparable to or slightly better than an unskilled human, while superhuman AGI outperforms any human in all relevant tasks. This classification system aims to quantify attributes like performance, generality, and autonomy of AI systems without necessarily requiring them to mimic human thought processes or consciousness.

AGI Performance Benchmarks

DeepMind emphasizes that the definition of AGI should focus on capabilities rather than the methods used to achieve them. For instance, an AI model does not need to demonstrate its abilities in real-world scenarios; it is sufficient if it shows the potential to surpass human abilities in given tasks under controlled conditions. This approach allows researchers to measure AGI based on specific performance benchmarks rather than subjective criteria. For example, an AI system might be considered competent if it outperforms 50% of skilled adults in various non-physical tasks and superhuman if it exceeds 100% of skilled adults.

iAsk Pro’s MMLU Pro Benchmark Results 85.85% Exceeds Human Experts and Reaches AGI Level