Is GPT-5.5 Reliable For Citations? No. It’s The Worst Flagship For That Job.
The one GPT-5.5 benchmark that matters for your critical AI literacy.
TL;DR
GPT-5.5 launched April 23, 2026 and tops every builder benchmark — Terminal-Bench, OSWorld, GDPval, ARC-AGI-2, long-context MRCR. It also posts an 86% hallucination rate on Artificial Analysis’s AA-Omniscience benchmark, against 36% for Claude Opus 4.7 and 50% for Gemini 3.1 Pro Preview. For citation work — deep research, regulatory references, GEO source claims — GPT-5.5 is the worst flagship choice. Use Claude Opus 4.7 for facts, GPT-5.5 for code and reasoning, and a two-model verification pass when both matter.
GPT-5.5 launched April 23, 2026.
It tops every benchmark that matters for builders.
Terminal-Bench. OSWorld. GDPval. ARC-AGI-2. Long-context MRCR. A real jump.
It’s also the most confidently wrong flagship on the market.
Hey, I’m Karo Zieminski 🤗 AI Product Manager and builder. I write Product with Attitude, an AI newsletter community of 17,000+ subscribers learning to build with AI and developing critical AI literacy through practice. The kind where you sit down on a Saturday morning, follow a guide, and walk away with a working agent, automation, or product. Built by you. Understood by you. Owned by you.
If you’re new here, welcome! Here’s what you might have missed:
→ Claude Design Review: 48-Hour Builder’s Test + Hero Prompts
→ I Mapped the Opus 4.7 Release to Your Role, Goals, and Real Workflows
Join 17K readers from around the world and learn with us.
What’s Inside
The 86% number AA-Omniscience surfaced and OpenAI didn’t lead with.
Why GPT-5.5 confabulates more than any flagship right now.
The two-model workflow I recommend for citation-heavy work.
What this means for your critical AI literacy.
Where Grok and Opus actually sit on the hallucination leaderboard.
86%: The Benchmark OpenAI Didn’t Lead With
I thought I’d skim the benchmarks.
I did not skim the benchmarks.
I’ve been staring at them all morning, trying to get them to agree with each other.
On AA-Omniscience, the benchmark designed to penalize confident wrong answers, GPT-5.5 (xhigh) hits an 86% hallucination rate.
Same benchmark, other flagships:
Claude Opus 4.7 (max): 36%
Gemini 3.1 Pro Preview: 50%
GPT-5.5 (xhigh): 86%
Here’s the quick decision matrix from the same data:
AA-Omniscience defines hallucination rate as the share of non-correct responses where the model confabulated instead of abstaining. Not 86% of all answers.
It doesn’t mean the model is wrong 86% of the time. It means: when GPT-5.5 doesn’t know, it almost never tells you. It guesses. In the same tone it uses when it’s right.
GPT-5.5 also posts the highest accuracy AA has ever recorded: 57%. That’s the trade. It knows more, builds better, answers more, and makes things up more.
Why This Matters for Citation Work
Confabulation is a specific failure where the model invents names, numbers, dates, regulations, URLs. It’s like someone telling a story and casually adding details that were never there.
OpenAI’s own system card does show the pattern.
When GPT‑5.5 makes a single factual claim (one date, one name, one number), it’s right more often than GPT‑5.4. About 23% more often.
But when you look at the whole answer, the improvement is tiny: the chance that there’s any mistake in the response only improves by 3%.
That’s because GPT‑5.5 crams more facts into every answer. More names, more dates, more specifics. Each detail gets a bit better, but there are so many that the odds of something being off don’t really drop.
Artificial Analysis put it plainly:
This makes it more likely to answer a question when it does not ‘know’ the answer.
For deep research, regulatory references, source claims, anything else where one fabricated citation kills the research, GPT-5.5 is the wrong tool.
What This Means for Critical AI Literacy
This is where it starts being about us, not the model.
The risk is that we can’t hear the difference.
If the tone stays steady:
right answers sound confident
wrong answers also sound confident
…it’s easy for our brain to decide: this feels reliable.
That’s the moment to take a second look.
Critical AI literacy isn’t about memorizing which model is best this week.
It’s about building a reflex:
Treat fluency as a product design choice taken by the people who built the model, not a signal of truth.
Assume that as these systems get more capable, they’ll bet on a design choice that doesn’t slow you down. Remember: faster isn’t better in every scenario.
Verify claims against other sources and your own expertise.
For citation-heavy work, I’m sticking with Claude Opus 4.7.
The hallucination rate isn’t perfect at 36%, but it’s far lower, and it actually hesitates when it should.
If cost matters, draft with GPT-5.5 and verify in Opus 4.7. AA's own data: GPT-5.5 (medium) scores the same as Claude Opus 4.7 (max) on the Intelligence Index at one quarter of the cost (~$1,200 vs $4,800).
The verification prompt:
Here’s the response you just generated. For every claim with a date, number, name, or quoted text, tell me:
(1) the claim,
(2) a source you can point to,
(3) confidence the source says exactly this.
If you can’t name a source, say so explicitly.It’s a good habit anyway.
Found this useful? Share with 3 friends or colleagues who'd benefit, and you'll get a free month of premium membership.
One Footnote on “Opus Is Best”
It isn’t. Not on raw hallucination rate.
Grok 4.20 leads AA’s hallucination leaderboard at 17%.
Opus wins because it pairs low hallucination with high accuracy.
Grok wins on hedging, it abstains more often. Worth knowing if Grok ever ships in your stack.
Bottom Line
GPT-5.5 is the best flagship for coding, reasoning, agentic planning, and 1M-token context work.
It’s the worst for citation generation, regulatory references, and anything where confident-wrong is the failure mode.
Don’t get attached to the brand. Get good at picking the right tool for the task.
WHY SUBSCRIBE ・YOUR BENEFITS・ TOOLS I BUILT・CLAUDE HUB・PERPLEXITY HUB ・VIBE CODING HUB










Thank you, Karo, valuable benchmark going forward. A protocol with missing, fact and unsure works when output data is critical.
I wouldn’t have spent any time on the benchmarks, so this is very useful. Thank you, Karo.