How We Measured Hallucination Rates
Most AI hallucination studies rely on synthetic benchmarks — curated question sets designed in academic environments. Our approach is different. We measured factual accuracy using Search Umbrella's Trust Score evaluation framework, which evaluates real user queries across real-world domains.
The Trust Score framework uses 7 metrics to assess each AI response. For this analysis, we focused on Factual Accuracy (FA) — the metric that directly measures whether a model's response contains verifiable, correct information versus fabricated or incorrect claims.
Here is how the evaluation works:
- Real user queries, not synthetic benchmarks. Every query in our dataset came from an actual user interaction — professional research questions, technical queries, legal lookups, business analysis requests, and general knowledge questions.
- Independent evaluation. Every response is evaluated independently by a separate evaluator model. The evaluating model assesses factual accuracy on a 0-10 scale without knowledge of which model produced the response.
- Evaluation period: December 2025 through February 2026.
- Scale: 2,637 evaluations across 32 models and 8 domains.
- Transparency: Full methodology is documented at howismyai.com/methodology.html.
This is not a lab benchmark. It is a real-world factual accuracy measurement across the kinds of queries professionals actually submit to AI systems every day.
2026 AI Hallucination Rate Rankings
The following table ranks 20 AI models by Factual Accuracy score from our 2,637-evaluation dataset. A higher FA score means fewer hallucinations and more factually reliable responses.
| Rank | Model | Provider | Factual Accuracy | Trust Score | Evaluations |
|---|---|---|---|---|---|
| 1 | GPT-5 Mini | OpenAI | 8.92 | 8.80 | 26* |
| 2 | GPT-5 | OpenAI | 8.82 | 8.83 | 60 |
| 3 | Gemini 2.5 Pro | 8.78 | 8.96 | 16* | |
| 4 | GPT-5.2 | OpenAI | 8.54 | 8.71 | 62 |
| 5 | GPT-5.1 | OpenAI | 8.43 | 8.52 | 23* |
| 6 | GPT-5.2 (Thinking) | OpenAI | 8.39 | 8.71 | 185 |
| 7 | GPT-4.1 | OpenAI | 8.34 | 8.64 | 50 |
| 8 | Gemini 3 Flash | 8.16 | 8.64 | 51 | |
| 9 | GPT-5.1 (Thinking) | OpenAI | 8.05 | 8.37 | 79 |
| 10 | Gemini 2.5 Flash | 7.78 | 8.67 | 41 | |
| 11 | Claude Sonnet 4.5 | Anthropic | 7.78 | 8.40 | 651 |
| 12 | Grok 4 | xAI | 7.72 | 8.02 | 39 |
| 13 | GPT-4.1 Nano | OpenAI | 7.67 | 8.74 | 21* |
| 14 | Gemini 3 Pro | 7.64 | 8.26 | 479 | |
| 15 | Claude Opus 4.6 | Anthropic | 7.57 | 8.16 | 117 |
| 16 | Grok 4 (Reasoning) | xAI | 7.51 | 8.08 | 142 |
| 17 | GPT-4o | OpenAI | 7.26 | 8.29 | 27* |
| 18 | GPT-5 (Generic) | OpenAI | 7.21 | 7.72 | 41 |
| 19 | Sonar Pro | Perplexity | 7.01 | 7.70 | 67 |
| 20 | Grok 4.1 (Reasoning) | xAI | 6.96 | 7.68 | 99 |
* = fewer than 30 evaluations — interpret with caution. Data from Search Umbrella's Trust Score evaluation framework, December 2025 – February 2026. See the full interactive leaderboard at howismyai.com/leaderboard.html.
Several patterns emerge from this data that are worth examining closely. The spread between the top performer (GPT-5 Mini at 8.92) and the lowest-ranked model here (Grok 4.1 Reasoning at 6.96) is nearly two full points on a 10-point scale. On identical queries, some models produce reliably factual responses while others introduce errors at significantly higher rates.
Key observation: Factual Accuracy and overall Trust Score do not always move together. Gemini 2.5 Pro has the highest overall Trust Score (8.96) but ranks third on Factual Accuracy (8.78). GPT-4.1 Nano scores 8.74 on Trust Score but only 7.67 on Factual Accuracy. A model can be helpful, well-structured, and coherent while still getting facts wrong.
Hallucination Rates by Domain
Not all queries carry equal hallucination risk. Our data reveals dramatic differences in factual accuracy depending on the subject domain of the query. Understanding where models struggle most is critical for professionals who depend on AI in high-stakes fields.
| Domain | Avg Trust Score | Evaluations | Hallucination Risk |
|---|---|---|---|
| Coding | 8.61 | 713 | Lowest |
| Business | 8.30 | 134 | Low-Moderate |
| Legal | 8.30 | 44 | Low-Moderate |
| Technical | 8.27 | 433 | Low-Moderate |
| Creative | 8.25 | 53 | Moderate |
| General | 7.72 | 1,116 | Elevated |
| Research | 7.39 | 107 | Highest |
Why Research Queries Are the Most Dangerous
Research queries — requests for specific studies, data points, statistics, and academic citations — show the highest hallucination risk in our data. This is precisely the domain where accuracy matters most and where fabricated information is hardest to detect. An AI model that confidently cites a nonexistent study with a plausible author name, journal, and publication year produces an error that looks indistinguishable from a real citation. For professionals relying on AI for research, this domain-specific risk demands verification protocols, not blind trust.
Coding queries rank highest for accuracy because code can be tested — the compiler or interpreter provides an objective verification signal. There is no equivalent "compiler" for a legal opinion, a market sizing estimate, or a historical claim. That asymmetry is exactly why factual accuracy varies so dramatically by domain.
The Legal domain (8.30, 44 evaluations) performs better than many expect, but the sample size is relatively small. More importantly, the cost of a legal hallucination — a fabricated case citation in a court filing, for instance — is orders of magnitude higher than a coding error that fails at runtime. Domain accuracy scores must be interpreted alongside the stakes of being wrong.
Want to see how your AI performs on the queries that matter to your work?
Try Cross-Model Verification FreeKey Findings
1. The 5.8-Point Accuracy Gap Is Larger Than Most People Expect
When we ran identical queries through all 32 models, the factual accuracy gap between the best and worst performers was 5.8 points on a 10-point scale. That is not a marginal difference — it means on the same question, one model provides a reliably factual response while another introduces significant errors. If you are using a single AI model without verification, your accuracy depends entirely on which model you happened to choose.
2. OpenAI Dominates the Top but Shows High Variance
Seven of the top ten factual accuracy scores belong to OpenAI models. But OpenAI also shows the widest internal variance: GPT-5 Mini scores 8.92 while GPT-5 Generic scores 7.21. The model version you use within the same provider matters enormously. GPT-4.1 (8.34) outperforms GPT-5 Generic (7.21) and GPT-5.1 Thinking (8.05) — newer does not always mean more accurate.
3. Claude Sonnet 4.5 Provides the Most Statistically Robust Data Point
With 651 evaluations — more than triple any other model in the dataset — Claude Sonnet 4.5's factual accuracy score of 7.78 is the most statistically reliable number in our rankings. Models with fewer than 30 evaluations (marked with *) may shift significantly as more data accumulates. Claude's large sample size means its 7.78 score represents a stable, reliable measurement of real-world factual accuracy.
4. Research and General Domains Carry the Highest Hallucination Risk
General knowledge queries (7.72 trust score, 1,116 evaluations) and Research queries (7.39, 107 evaluations) are the two domains where AI models are most likely to hallucinate. These are also — not coincidentally — the domains where users are most likely to trust AI responses without independent verification, because the queries feel straightforward. "What year did X happen?" or "What study showed Y?" feel like simple factual lookups. But these are precisely the queries where models most frequently fabricate confident-sounding wrong answers.
5. Cross-Model Verification Catches What Single-Model Use Misses
The fundamental insight from this data: no single model is reliably accurate across all domains and query types. The variance between models on identical queries means that any single-model workflow has blind spots. When you ask one model and get a confident answer, you have no signal about whether that particular response falls in the model's strength zone or its hallucination zone.
How Cross-Model Verification Reduces Hallucination Risk
Search Umbrella was built on a principle that this data reinforces: the most reliable way to reduce hallucination risk is to run the same query through multiple models and check for consensus.
Here is how it works in practice:
- One query, 8+ models. Your query goes to ChatGPT, Claude, Gemini, Grok, Perplexity, LLaMA, Mistral, and AI21 simultaneously. No extra typing, no tab-switching, no managing multiple accounts.
- Trust Score flags disagreement. When all eight models agree on a factual claim, the Trust Score is high — the cross-model consensus provides strong evidence that the claim is accurate. When models diverge, the Trust Score drops, flagging exactly where you should investigate further before acting.
- Ensemble Disagreement metric. The Trust Score includes an Ensemble Disagreement component that specifically measures cross-model consensus. High disagreement on a specific claim is a quantified hallucination risk signal.
- Domain-aware interpretation. In domains where accuracy varies the most — Research, General — cross-model verification is most valuable because that is where individual models are most likely to hallucinate. A Research query where seven of eight models agree on a statistic is dramatically more trustworthy than any single model's confident assertion.
The point is not that any individual model is bad. The point is that no individual model is reliable enough to trust without verification — and the most efficient verification is checking what other models say about the same question.
This is the core principle behind Search Umbrella. Not "which AI is best" — but "what do all the AIs agree on, and where do they disagree?" The disagreement is the signal. It tells you exactly where hallucination risk is highest for your specific query.
For a deeper look at how this multi-model approach works and why it matters, see our full explanation: The Unified LLM Approach.
Frequently Asked Questions
What is the AI hallucination rate?
AI hallucination rates vary significantly by model, domain, and query type. In our testing of 32 models across 2,637 real-world evaluations, factual accuracy scores ranged from 6.0 to 8.92 on a 0-10 scale. This translates to hallucination rates (as inverse accuracy) that range from roughly 10% for the best models to 40% for the worst on certain query types. Coding queries show the lowest hallucination rates, while Research and General knowledge queries show the highest. There is no single "AI hallucination rate" — it depends entirely on which model you use and what you ask it.
Which AI hallucinates the least?
In our 2026 data, GPT-5 Mini scored the highest factual accuracy at 8.92/10 — but with only 26 evaluations, that score should be interpreted with caution. The most statistically reliable top performer is GPT-5 at 8.82/10 with 60 evaluations. Among models with large evaluation counts, GPT-5.2 Thinking (8.39 FA, 185 evaluations) and Claude Sonnet 4.5 (7.78 FA, 651 evaluations) provide the most robust data. See the full interactive leaderboard for the complete rankings across all 32 models.
Do newer models hallucinate less?
Generally yes — but with important exceptions. Our data shows GPT-5 Generic scored only 7.21 on factual accuracy, while the older GPT-4.1 scored 8.34. Grok 4.1 Reasoning (6.96) performed worse than Grok 4 (7.72). Thinking and reasoning model variants sometimes sacrifice raw factual accuracy for more deliberate step-by-step analysis. The lesson: do not assume that a newer model version is automatically more factually accurate. Test it, or better yet, compare it against multiple models on the queries that matter to your work.
How can I check if AI is hallucinating?
Three proven methods:
- Cross-model verification. Run the same query through multiple AI models and check for consensus. When models trained on different data by different organizations reach the same conclusion, the probability that all of them independently hallucinated the same fact drops dramatically. Search Umbrella automates this across 8+ models with a Trust Score.
- Source checking. Ask the AI for its sources and independently verify them. Fabricated citations — real-sounding journal names, plausible author names, realistic publication dates — are one of the most common and most dangerous hallucination patterns.
- Reasoning chain analysis. Ask the model to show its step-by-step reasoning. Hallucinated facts often appear as unsupported leaps — a conclusion that does not follow from the preceding reasoning steps.
Of these, cross-model verification is the most reliable for a simple reason: it does not depend on the hallucinating model to identify its own errors. An external check from independent models is inherently more trustworthy than asking the same model "are you sure?"
