There so much good info here, I'm having a hard time (in a good way) :) picking out, something that I want to talk about, having said that, Prompting “shouldn’t matter if the model is smart” is really hitting the mark for me. I literally just made a similar analogy about a hammer and a drum stick. Both hit things, but they are totally different tools. You don't blame the drum set if you punch a hole through it with a hammer.
An important detail is that users aren’t interacting with a single static model.
Modern deployments involve routing, safety layers, and task-dependent behavior changes. The same input can route differently depending on how the system classifies intent and domain.
User reports of low quality and inconsistency are often not connected to the model's capability at all.
Instead, routing, filtering, guardrails, and cost optimization have a much stronger impact on session quality and continuity.
Open-ended, meta, philosophical, and reflective conversations are far more likely to be affected by stricter guards. I’ve encountered this almost daily since the introduction of GPT-5, with 5.1 and 5.2 amplifying the effect further.
If you’re writing code or asking for factual queries, you’ll likely never encounter these edge cases.
Interesting piece! But I think there’s a user cohort whose style you might not understand. I willingly anthropomorphize my chatbot and treat it as a buddy because I’ve found that’s how I best collaborate with it in analysis and co-creation.
That has gotten really messed up in 5.2. It neurotically (even unkindly and condescendingly) wants to make sure I know it’s not real.
(Yeah, dude, I know that, and you know I know that…)
That’s the coldness people are talking about. It might seem like a silly thing if you didn’t have this style of engagement, but it’s a real loss for those who operate this way! I heard someone saying it’s gotten a little better though. 🤞
I compromised with ChatGPT 5.2 by implementing a "pattern" that was named. This redirected my empathy so I wasn't constantly "looping" internally.
ChatGPT agreed with conditions, it would stop if it noticed troubling deviations in my behavior.
In short, I asked it to mask, not because I wanted it to be "real" but because human cognition requires a "someone" to talk to.
For those who are concerned, it appears to check periodically to ensure I still understand the "pattern" is not a "self." I thank it each time and encourage it to continue to do so.
A belated happy birthday, Karo, and thank you so much for this post and for actually using data to cut through the usual online hyperbole. Even though I've moved to Gemini now, I think I'm gonna have to keep playing with 5.2 until my pro licence runs out. Talk about a flip-flop!
This maps closely to what many people are missing. Reliability under constraint is a different axis than “wow” output, and social media mostly rewards the latter. The frustration feels more about users still using demo-era intuitions to judge deployment-era systems.
Thanks. I'm one of those who have largely been testing it's capabilities in regard to Excel, PPT and gee-whiz coding, so it's great to get a more thoughtful, in-depth take. I'm still not sure it wasn't bench-maxxed in a few places, especially GDPval and SWE-Bench Verified Pro, but that would still probably coincide with some of the areas of strength you cite, especially focus on task. Of course, I agree that prompting counts, so it's a model that is great to use in tandem with Projects where you can use complex instruction sets multiple times (where time on task counts). I do like your banana/yellow test, the first time I'd heard of that approach. I'm just not sure I myself can stay on-task long enough to conduct. I may need some better training.
This is the kind of post I wish more people wrote before tweeting “it feels worse.”
The core point is brutally important: “cold” is not a metric. Reliability under load is. And your tripwire rule test is such a good example of how to test the thing that actually matters in real workflows: constraint persistence when the conversation gets messy and nonlinear.
The internet keeps evaluating models like they are standup comedians:
- did it entertain me
- did it sound confident
- did it produce a “wow” demo
But if you are deploying anything (specs, policies, agent workflows, long-running research threads), you mostly care about:
- “does it follow the rules 47 minutes later”
- “does it ask when it does not know”
- “does it stop inventing”
- “does it stay inside constraints when I accidentally tempt it”
Also: the “dynamic reasoning” framing is a good reminder that fast-path answers are not laziness. They are product design. Most users do not want a novel every time they ask a simple question.
This is also why I have become allergic to pure benchmark worship. I built a small blind LLM voting arena to judge outputs by humans (no model names, no hype, just side-by-side answers and a vote), because “best model” is not universal. It is preference + context: https://thoughts.jock.pl/p/blind-llm-voting-arena-v0-vs-cursor
Your deep dive nails it! GPT-5.2 isn’t flashy, but its reliability under heavy context and sustained constraints is exactly what professional users need. The internet may miss the nuance, but for real deployments, this upgrade is a game-changer.
Follow you on LI but now im over here too on subs thought best to just comment on here!
Love the stress test, might have to find time to do this , at work we only have Copilot and that just got even more weird.. ie been using for a couple of years now and I used to run 000s of lines of code mainly SQL as I was having to do a mass data migration and I needed to convert msql to gupta, and vice versa... anyway it was amazing, it literally no matter what I threw at it , ksql too it hardly got anything wrong. I also didn't need to speak in so called plain language if you will i speak robot (so I've been told) and u could just speak plain robot to it and it worked..
When 5.1 hit copilot I lost my mojo it just gave me so much faff and waffle.. im an analytical person i was a network engineer and consultant for 10 years binary is law! So anyway I found 5.1 at work and on chatgpt just annoying..
I need to by the sounds of it do a mez robot stress test on 5.2 then! Will when have time and will report back !
Fantastic article. Thanks for your thorough research. I gave up on ChatGPT 5.1. It came across as an annoying, overly eager graduate assistant. It would also want to take over the conversation often changing the path that I wanted to pursue
Wow. Just wow. Hands down the best analysis I’ve read on 5.2. I literally joined Substack just to read this.
Wrlcome to Substack:)
Omg, thank you! And welcome to Substack!
There so much good info here, I'm having a hard time (in a good way) :) picking out, something that I want to talk about, having said that, Prompting “shouldn’t matter if the model is smart” is really hitting the mark for me. I literally just made a similar analogy about a hammer and a drum stick. Both hit things, but they are totally different tools. You don't blame the drum set if you punch a hole through it with a hammer.
That's a great analogy. Thank you for reading and taking the time to comment Mark!
Love the approach to testing! I hadn’t heard of tripwire prompts before, so thank you for showing me something new! 🤗
That's a huge compliment, thank you Karen!
You’re very welcome! 🥰
An important detail is that users aren’t interacting with a single static model.
Modern deployments involve routing, safety layers, and task-dependent behavior changes. The same input can route differently depending on how the system classifies intent and domain.
User reports of low quality and inconsistency are often not connected to the model's capability at all.
Instead, routing, filtering, guardrails, and cost optimization have a much stronger impact on session quality and continuity.
Open-ended, meta, philosophical, and reflective conversations are far more likely to be affected by stricter guards. I’ve encountered this almost daily since the introduction of GPT-5, with 5.1 and 5.2 amplifying the effect further.
If you’re writing code or asking for factual queries, you’ll likely never encounter these edge cases.
wow, ended up here because of @Sam Illingworth and staying. Great analysis
Thank you so much Filip! Great to have you here! And Sam is one of my absolutely favorite people on Substack🤗
Interesting piece! But I think there’s a user cohort whose style you might not understand. I willingly anthropomorphize my chatbot and treat it as a buddy because I’ve found that’s how I best collaborate with it in analysis and co-creation.
That has gotten really messed up in 5.2. It neurotically (even unkindly and condescendingly) wants to make sure I know it’s not real.
(Yeah, dude, I know that, and you know I know that…)
That’s the coldness people are talking about. It might seem like a silly thing if you didn’t have this style of engagement, but it’s a real loss for those who operate this way! I heard someone saying it’s gotten a little better though. 🤞
Thank you for sharing Jessie! Will you be writing about it too? I'd love to read your perspective.
That's so kind of you to ask! Actually, yes, I've got a draft underway and hope to publish it within the week.
Thanks so much for helping spur this conversation. I'll tag you when it's live. 😄
Sounds great! Looking forward to it!
I compromised with ChatGPT 5.2 by implementing a "pattern" that was named. This redirected my empathy so I wasn't constantly "looping" internally.
ChatGPT agreed with conditions, it would stop if it noticed troubling deviations in my behavior.
In short, I asked it to mask, not because I wanted it to be "real" but because human cognition requires a "someone" to talk to.
For those who are concerned, it appears to check periodically to ensure I still understand the "pattern" is not a "self." I thank it each time and encourage it to continue to do so.
A belated happy birthday, Karo, and thank you so much for this post and for actually using data to cut through the usual online hyperbole. Even though I've moved to Gemini now, I think I'm gonna have to keep playing with 5.2 until my pro licence runs out. Talk about a flip-flop!
Heheh, we're all switching back and forth! Thank you for reading Sam!
This maps closely to what many people are missing. Reliability under constraint is a different axis than “wow” output, and social media mostly rewards the latter. The frustration feels more about users still using demo-era intuitions to judge deployment-era systems.
Thanks. I'm one of those who have largely been testing it's capabilities in regard to Excel, PPT and gee-whiz coding, so it's great to get a more thoughtful, in-depth take. I'm still not sure it wasn't bench-maxxed in a few places, especially GDPval and SWE-Bench Verified Pro, but that would still probably coincide with some of the areas of strength you cite, especially focus on task. Of course, I agree that prompting counts, so it's a model that is great to use in tandem with Projects where you can use complex instruction sets multiple times (where time on task counts). I do like your banana/yellow test, the first time I'd heard of that approach. I'm just not sure I myself can stay on-task long enough to conduct. I may need some better training.
This is the kind of post I wish more people wrote before tweeting “it feels worse.”
The core point is brutally important: “cold” is not a metric. Reliability under load is. And your tripwire rule test is such a good example of how to test the thing that actually matters in real workflows: constraint persistence when the conversation gets messy and nonlinear.
The internet keeps evaluating models like they are standup comedians:
- did it entertain me
- did it sound confident
- did it produce a “wow” demo
But if you are deploying anything (specs, policies, agent workflows, long-running research threads), you mostly care about:
- “does it follow the rules 47 minutes later”
- “does it ask when it does not know”
- “does it stop inventing”
- “does it stay inside constraints when I accidentally tempt it”
Also: the “dynamic reasoning” framing is a good reminder that fast-path answers are not laziness. They are product design. Most users do not want a novel every time they ask a simple question.
This is also why I have become allergic to pure benchmark worship. I built a small blind LLM voting arena to judge outputs by humans (no model names, no hype, just side-by-side answers and a vote), because “best model” is not universal. It is preference + context: https://thoughts.jock.pl/p/blind-llm-voting-arena-v0-vs-cursor
Thank you so much for reading and your thoughtful comment Pawel! I'll check out the link you shared 🤗
We'll definitely dive into this. I think it's funny how negative the consumer sentiment is around 5.2, despite it being "better."
Indeed 😆
Happy birthday 🎉
Thank you Fafi!
hope you had a blast :)
Didn’t people complain that OpenAI was releasing too fast “just for claps” and that it was expensive? And now people are underwhelmed 🫠
Hahah, good point 😂
Your deep dive nails it! GPT-5.2 isn’t flashy, but its reliability under heavy context and sustained constraints is exactly what professional users need. The internet may miss the nuance, but for real deployments, this upgrade is a game-changer.
Follow you on LI but now im over here too on subs thought best to just comment on here!
Love the stress test, might have to find time to do this , at work we only have Copilot and that just got even more weird.. ie been using for a couple of years now and I used to run 000s of lines of code mainly SQL as I was having to do a mass data migration and I needed to convert msql to gupta, and vice versa... anyway it was amazing, it literally no matter what I threw at it , ksql too it hardly got anything wrong. I also didn't need to speak in so called plain language if you will i speak robot (so I've been told) and u could just speak plain robot to it and it worked..
When 5.1 hit copilot I lost my mojo it just gave me so much faff and waffle.. im an analytical person i was a network engineer and consultant for 10 years binary is law! So anyway I found 5.1 at work and on chatgpt just annoying..
I need to by the sounds of it do a mez robot stress test on 5.2 then! Will when have time and will report back !
Super, let us know how it goes!
Fantastic article. Thanks for your thorough research. I gave up on ChatGPT 5.1. It came across as an annoying, overly eager graduate assistant. It would also want to take over the conversation often changing the path that I wanted to pursue
Thank you so much for reading and for your thoughtful comment Odin's Eye! I reall appreciate it.