GPT-5.5 outperforms and hallucinates
- Anastasia Karavdina
- 4 days ago
- 2 min read
OpenAI’s latest flagship model seems to tell two stories at once: on the one hand, it pushes the frontier again, setting new state-of-the-art results across important benchmarks for knowledge work, agentic coding, computer use, and abstract visual reasoning; on the other hand, it appears to struggle with one of the most important skills for real-world AI systems, which is knowing when not to answer.
GPT-5.5 is clearly powerful.
It tops the Artificial Analysis Intelligence Index, performs extremely well on ARC-AGI-2, delivers strong results on command-line workflows, autonomous computer-use tasks, and multi-turn customer-service simulations, and shows that OpenAI is still very much competing at the frontier of model capability.
But the more interesting part is not only what GPT-5.5 can do.
It is what it does when it is wrong.
According to the reported benchmark results, GPT-5.5 often knows more than its peers, but it is also more likely to answer incorrectly instead of admitting uncertainty, which creates a very different risk profile for teams that want to use it in production systems where confidence, reliability, and escalation behavior matter as much as raw performance.
That distinction is critical because benchmark leadership can make a model look like the obvious choice, while day-to-day usage may reveal a different picture: a model that is brilliant in many cases, but occasionally too confident in moments where a cautious “I don’t know” would be far more valuable.
This is exactly where objective benchmarks and human preference evaluations seem to diverge.
On capability-focused tests, GPT-5.5 looks extremely strong.
On blind head-to-head comparisons and subjective evaluations, however, competitors such as Claude Opus models appear to perform better in several categories, suggesting that what a model can theoretically accomplish and what it feels like to work with in practice are not always the same thing.
For developers, product teams, and AI leaders, the lesson is not simply “use the model with the highest score.”
The lesson is to evaluate models across multiple dimensions: capability, cost, latency, hallucination behavior, tool-use reliability, refusal quality, user experience, security implications, and the ability to admit uncertainty when the task requires it.
GPT-5.5 may be a major step forward, but it also reinforces a pattern we keep seeing at the frontier: 𝐭𝐡𝐞 𝐛𝐞𝐬𝐭 𝐦𝐨𝐝𝐞𝐥 𝐨𝐧 𝐨𝐧𝐞 𝐛𝐞𝐧𝐜𝐡𝐦𝐚𝐫𝐤 𝐢𝐬 𝐧𝐨𝐭 𝐚𝐮𝐭𝐨𝐦𝐚𝐭𝐢𝐜𝐚𝐥𝐥𝐲 𝐭𝐡𝐞 𝐛𝐞𝐬𝐭 𝐦𝐨𝐝𝐞𝐥 𝐟𝐨𝐫 𝐞𝐯𝐞𝐫𝐲 𝐰𝐨𝐫𝐤𝐟𝐥𝐨𝐰, 𝐞𝐯𝐞𝐫𝐲 𝐮𝐬𝐞𝐫, 𝐨𝐫 𝐞𝐯𝐞𝐫𝐲 𝐫𝐢𝐬𝐤 𝐩𝐫𝐨𝐟𝐢𝐥𝐞.
Please keep in mind that the model choice is not a one-time architectural decision.
It is an ongoing product, engineering, and risk-management decision, and is worth revisiting regularly.

Comments