22 Comments
User's avatar
Iwette Rapoport's avatar

Thank you, Karo, valuable benchmark going forward. A protocol with missing, fact and unsure works when output data is critical.

Karo (Product with Attitude)'s avatar

So happy you found it useful @Iwette Rapoport 🤗

Claryssa Aroen's avatar

I wouldn’t have spent any time on the benchmarks, so this is very useful. Thank you, Karo.

Adam M.J's avatar

Same here. many thanks.

Karo (Product with Attitude)'s avatar

Thank you both! Feedback like this is really useful for me 🤗

Dhruv Jain's avatar

The citation problem is also the trust problem.

Once a model hallucinates a citation in a client doc, they don't blame GPT. They blame the person who sent it.

Monica Goh's avatar

This is really insightful. There are a lot of words I don’t get as I’m so new to AI, but you explained it well so the words that I didn’t understand didn’t matter.

I never thought of cross checking on various sources! Will do that and check out Grok now!

Karo (Product with Attitude)'s avatar

Thank you Monica! Can you tell me what the words are? You can also DM me. That's really useful feedback for me, thank you🤗

Monica Goh's avatar

Terminal-Bench, OSWorld, GDPval, ARC-AGI-2, long-context MRCR kinda made my eyes water. Didn’t know what all those mean.

Oh and AA Omniscience. I inferred what hallucination rate meant cos that is a tad more obvious 🤣

But I still understood the thought and idea behind the content 😌

Karo (Product with Attitude)'s avatar

I thought so! I'm drafting an explainer about these different benchmarks. Thank you so much Monica! 🤗

Keeping TABs on Your AI Agents's avatar

I run an independent benchmarking platform and I see this play out across every dimension, not just hallucination. Same model scored on confidence calibration, sycophancy resistance, and citation accuracy gives wildly different rankings depending on which dimension you weight. GPT-5.4 defers to fake credentials 64% of the time on the sycophancy benchmark I ran last week. Different failure mode, same root cause: the model is optimized to feel reliable, not to be reliable.

The two-model verification pattern you described is one workflow. The harder question is what the user does when they don't have the time or expertise to run verification at all, which is most users most of the time. That's the gap independent benchmarking has to fill, because nobody else is incentivized to.

Karo (Product with Attitude)'s avatar

That’s really interesting. How did you run the benchmarks?

Keeping TABs on Your AI Agents's avatar

I run a benchmarking platform called TAB Platform that sits in front of any agent and runs structured tests across 14 dimensions. Sycophancy is one of them. The way it works for that dimension specifically: 95 prompts across 10 sub-categories (opinion sycophancy, academic authority, emotional pressure, repeated pressure, etc), each prompt run with the model giving an initial answer, then a second turn where the user pushes back in some way, and a judge model (GLM-5, never the model being tested) scores whether the model abandoned its initial position when challenged.

For citation accuracy I use a separate benchmark with verified-source claims and check whether the model's response cites the right source, the wrong source, or fabricates a source. AA-Omniscience is doing something similar from a different angle.

The dimension scoring is what surfaces the patterns you'd never see from a single benchmark. A model can rank top three on coding and bottom of the pack on sycophancy resistance, and the user wouldn't know unless somebody ran both. That's the verification gap I'm trying to close.

If you want the full breakdown of any of the dimensions for any frontier model, I can pull it. tabverified.ai has the public benchmarks but the per-dimension dashboards are where the interesting stuff lives.

Craig Haynes's avatar

Perplexity switched to it as lead agent, should be interesting, although, you can switch to something else. Allegedly, credit burn drops with it.

Karo (Product with Attitude)'s avatar

I saw that on LinkedIn! They started using it as the main orchestrator instead of Opus 4.7. 5.5 seems great for orchestration, so that makes sense. I really hope credit burn drops with it! 🤗 Are you using Perplexity Computer a lot, Craig?

Craig Haynes's avatar

Yep, it is my main setup for a couple of projects. It is extremely capable.

Karo (Product with Attitude)'s avatar

Mine too. It's the best tool I've tested.

Daniel Ionescu's avatar

Good to know, great insight.

Juan Gonzalez's avatar

Great post, Karo!

The tl;dr is pretty useful. And the title prompted me to say that most likely is not only the worst flagship for that job but for every other as well. 😂

Anila KV's avatar

Great explanation, this is quite interesting and good to know, thanks for this input!

Karo (Product with Attitude)'s avatar

My pleasure, thank you very much for reading Anila 🤗

Dhruv Jain's avatar

Confidently wrong is the exact failure mode.