iAsk Pro’s GPQA Benchmark Results: Outperforming the Experts

GPQA Diamond Comparison: iAsk pro vs. Other AI Models
*These results reflect pass@1 accuracy, meaning the model's performance is measured based on getting the correct answer on its first attempt.

iAsk Pro’s Top Score on GPQA

iAsk Ai’s latest model, iAsk Pro, has achieved the highest score on the GPQA benchmark, a rigorous test designed to challenge even the most advanced AI models. By excelling in fields like biology, physics, and chemistry, iAsk Pro has set a new standard for AI-driven problem-solving. Its ability to outperform leading models in accuracy and efficiency demonstrates iAsk Pro’s unmatched capabilities in handling complex scientific questions, solidifying its position as a leader in AI development.


What is GPQA?

GPQA (Graduate-Level Google-Proof Q&A Benchmark) is one of the most challenging tests ever designed for AI models. It comprises a vast collection of questions from biology, physics, and chemistry, curated by domain experts to challenge AI capabilities beyond traditional benchmarks.

There are three variations of the test dataset with variable question lengths: extended (546), main (448), and diamond (198). The original research benchmark compared zero-shot, few-shot, CoT, and search variations.

iAsk Pro’s standout performance comes from the GPQA Diamond subset, which focuses on the most difficult 198 questions. These are so challenging that even PhD experts only achieve around 65% accuracy (arxiv.org), while top-performing AI models struggle to get above 50%.


iAsk Pro’s Performance on GPQA Diamond

iAsk Pro achieved the highest score among all AI models on the GPQA Diamond Benchmark with 78.28% accuracy (download raw proof of our results), outperforming popular models like OpenAI’s o1 (OpenAI Learning to Reason with LLMs) and Anthropic’s Claude 3.5 Sonnet (AnthropicAI Introducing Claude 3.5 Sonnet) by nearly 19 percentage points.

This score was recorded using pass@1, a metric that measures how accurately an AI model answers on its first attempt. It indicates whether the model's very first response to a question is correct, with “pass” referring to meeting a predefined correctness standard and “1” meaning it’s based on the first try.

Beyond just the first attempt (pass@1), iAsk Pro also recorded 83.83% accuracy when allowed multiple attempts (cons@5), placing it at the top of the leaderboard. Cons@5 is a majority voting metric that measures accuracy over 5 attempts. It shows how often the model's most common answer (the one chosen by majority vote) is correct. This method leverages agreement among the model's responses, offering a more reliable indication of accuracy when multiple attempts converge on the same answer.

For comparison:

  • Claude 3.5 Sonnet achieved 67.2% accuracy with 32 attempts (cons@32).
  • OpenAI’s o1 reached 78.3% accuracy with 64 attempts (cons@64).

iAsk Pro’s ability to deliver more accurate answers with fewer tries proves its precision and efficiency, making it the best-performing model in this benchmark.

iAsk Pro used Chain of Thought (CoT) Reasoning, a technique that allows the model to break down complex problems into manageable steps, mimicking the thought process of human experts. This method has proven highly effective for reasoning-heavy benchmarks like GPQA Diamond.

With 78.28% overall accuracy on GPQA Diamond, iAsk Pro demonstrated its capability to tackle some of the most difficult problems across multiple scientific domains, surpassing all competitors.

Here is how iAsk Pro performed across all subjects tested:

iAsk Pro GPQA Results
*These results reflect pass@1 accuracy, meaning the model's performance is measured based on getting the correct answer on its first attempt.

The chart below highlights iAsk Pro's exceptional accuracy across the three major fields in the GPQA Diamond benchmark. Notably, iAsk Pro outperformed OpenAI’s o1 model (OpenAI Learning to Reason with LLMs) in two of three categories, showcasing superior accuracy across these subjects.

Comparison by Subject: iAsk Pro vs. Others
*iAsk Pro performs best in Biology and Physics, while Human Experts outperform all models in Chemistry

The Origins and Development of GPQA

The GPQA Benchmark was developed through a rigorous, multi-step process to challenge both human experts and advanced AI systems. The goal was to create a dataset that would push the boundaries of human knowledge and expertise, specifically focusing on biology, physics, and chemistry at a graduate level. Here’s how the questions are constructed:

Pipeline

1. Expert-Driven Question Creation

The questions in GPQA were created by domain experts with advanced degrees in their respective fields. These questions are designed to require multi-step reasoning and a deep understanding of complex scientific concepts, making them significantly more challenging than typical benchmark questions.

2. Expert Validation

After initial question creation, the dataset underwent an extensive expert validation process to ensure the quality and difficulty of the questions. This step involved verifying that the questions reflect true graduate-level challenges.

3. Question Revision

Based on the expert feedback, revisions were made to further enhance question clarity and difficulty. The revision process also focused on crafting answer options that make the questions resistant to superficial searches or lookup-based solutions.

4. Non-Expert Validation

To ensure the questions were "Google-proof," they were tested on non-experts who had unrestricted web access. These individuals achieved only 34% accuracy (arxiv.org), demonstrating the depth of knowledge required to answer GPQA questions correctly.

5. Benchmark for Advanced AI and Human Experts

The GPQA benchmark was designed to test the limits of AI models in generating reliable information in complex scientific domains. Even PhD-level experts struggle with these questions, achieving around 65% accuracy (arxiv.org) on average. For context, top AI models like GPT-4o perform at 50.6% accuracy (OpenAI Learning to Reason with LLMs) or less, showing the extreme challenge posed by this benchmark.


How Does GPQA Compare to Other Benchmarks like MMLU and MMLU-Pro?

While GPQA, MMLU (Massive Multitask Language Understanding), and the more difficult MMLU-Pro benchmarks all test AI’s ability to understand a wide range of topics, GPQA stands out for its focus on deep scientific reasoning. Here’s how GPQA differs:

1. Focus on Graduate-Level Scientific Reasoning

  • GPQA is specifically designed to test advanced scientific knowledge and reasoning in biology, physics, and chemistry. The questions are not just factual recall but require deep understanding and multi-step reasoning.
  • MMLU and MMLU-Pro, on the other hand, focus on a broader array of subjects, including history, economics, mathematics, and law, but the difficulty is often more knowledge-driven rather than the reasoning-intensive approach seen in GPQA.

2. Google-Proof Design

  • GPQA questions are deliberately designed to be "Google-proof," meaning they cannot be easily solved using basic web searches or superficial understanding. Even PhD-level experts find the GPQA Diamond questions extremely difficult.
  • MMLU and MMLU-Pro include some questions that can be tackled using general knowledge or surface-level reasoning, especially in the original MMLU. MMLU-Pro has evolved to include more reasoning-intensive questions but still spans a broader, more general knowledge base.

3. Complexity and Reasoning Depth

  • In GPQA, the complexity is significantly higher due to the need for multi-step reasoning to solve the problems. This is particularly true in the GPQA Diamond subset, which focuses on the hardest problems.
  • MMLU-Pro introduced more complex reasoning compared to the original MMLU by expanding the number of distractor options and increasing the difficulty of questions, but it still doesn’t reach the level of specialization seen in GPQA's domain-specific problems.

4. Subject-Specific Expertise

  • GPQA evaluates models on their ability to solve questions at a graduate level in specific scientific fields. This sets a high bar for models to demonstrate true subject-matter expertise.
  • MMLU and MMLU-Pro cover a wider range of topics, testing both general and specialized knowledge but across many different fields. However, they lack the subject-specific depth seen in GPQA’s expert-level scientific focus.

5. Differences in Performance Metrics

  • In MMLU-Pro, many leading models like GPT-4o score upwards of 72%, with iAsk Pro surpassing 85% accuracy. These benchmarks test a wide array of subjects, making it possible for models to perform well across easier topics while still struggling in more difficult ones.
  • In contrast, GPQA Diamond is a much tougher benchmark where even the top-performing AI models, such as OpenAI’s o1 and Claude 3.5, achieve only 59.4% accuracy (AnthropicAI Introducing Claude 3.5 Sonnet). This illustrates the extreme difficulty and specialized nature of GPQA.

Explore More on iAsk Pro and GPQA

Interested in learning more about iAsk Pro’s performance on the GPQA and other benchmarks?

  • Download the GPQA full dataset to explore iAsk Pro’s results in detail
  • View the full GPQA document for more details about the benchmark test (arxiv.org)
  • See iAsk Pro’s MMLU benchmark results (#1 of all AI models)
  • See iAsk Pro’s MMLU-Pro benchmark results (#1 of all AI models)