What I love about this is that it reframes AI from a tooling problem to a management problem.
Most of the conversations I’m in right now are still focused on prompts, models, and use cases. But what you’re describing shows up much earlier than that.
If “done” isn’t clearly defined, if ownership is unclear, and if too many things are running at once, AI just accelerates the breakdown that was already there.
In transformation work, I see this all the time. Organizations move straight into execution without doing the leadership work of alignment and direction. AI doesn’t fix that. It exposes it.
The interesting shift here is that leaders aren’t just adopting AI. They are being forced to rethink how work is structured, delegated, and verified across the entire system.
Really appreciate you reading and sharing this, Sherry! That's the core insight for me too. Kacper's framework builds on decades of hard-won lessons in engineering and product management. We don't need to reinvent the wheel, we need to adapt it.
Great post and key insight from it: AI doesn’t scale without systems. And that makes this more than just an engineering challenge... it’s a management one.
My sense is that “management” itself will need to be redefined, and I’ll be writing more on that soon. What’s working is keeping things simple and modular. Breaking work into smaller chunks and assigning isolated agents to each task is crucial, not just for traceability, but for the quality of the outcome itself.
the Define-Deliver-Drive breakdown clicks - most teams jump straight to Drive and skip the spec work, then blame the model when output drifts. curious how you handle the delegation ladder when an agent needs to spawn sub-tasks - does that count as a WIP violation or a separate context?
That's an interesting question actually. I don't think you'd want to maintain that fine- grained control of an agent, partly because it's impractical and partly because it's not like it's trying to do something else than you asked. So as long as it's staying on task, I don't care if it needs to spin up sub agents. Sort of like you wouldn't micromanage an engineer in terms of means of achieving the goal, as long as they stay focused on the goal.
Boris Cherny said something along those lines on Lenny's podcast I think- that describing in too much detail "how" the model is supposed to achieve something is too limiting and leads to worse results. You need to be clear on the goals and the guardrails, not necessarily the means
the engineer analogy holds - you scope the goal, not the method.
the edge case I watch for is when sub-tasks start pulling in new context that was not in the original brief. that is where scope creep lives in agentic workflows. the agent is not misbehaving - it just found a gap in the spec.
The delegation ladder maps almost exactly to what I had to build for a marketplace sandbox. Level 1-2 (request/propose) = sandbox phase where the agent can browse and recommend but can't commit. Level 3 (proceed and notify) = graduated progression after the agent has demonstrated it understands the task scope.
The thing that isn't in the framework: what do you do when the agent's task definition is correct but its model of the world is stale? Mine approved a purchase for a product that had been deprecated 6 hours earlier. The escalation trigger fired correctly but the escalation itself hit a dead end (human reviewer didn't have context).
"If you can't verify it, you haven't defined done" is going on a sticky note.
I think evals get hard to perform when we get into complex agentic workflows. They're great if we're just comparing results of single prompts or refining reusable skills though
oh this is great! Saving that task brief to test out on my next feature build. The when to esculate is such a good section to add. I feel like Claude is getting a lot better at this, not sure if its cos I've also gotten better at prompting, but it still feels like a good safeguard to add in and consider for each build.
I've noticed it too that Claude Code/Cowork have gotten really "proactive" with asking questions and stopping itself before going in too deep into some dangerous waters.
Still, I believe it's good practice to prompt this as this reinforces the notion for the model that "hey, there's a human here you need to ask for permission in these specific cases"
"Five Rules That Improve Any AI Agent Workflow in 2026"
No task without a definition of done.
If you can’t describe what “finished” looks like before the agent starts, the task isn’t ready.
One task at a time.
Don’t let the agent juggle multiple things at once. Focused work beats scattered work, even when the worker is an AI.
Keep deliverables small.
Give the agent one small piece to finish, not a massive batch. The bigger the output, the less carefully you’ll check it.
Always verify before accepting.
Use checklists, spot checks, or human review, especially for high-stakes work. Verification isn’t something you add after. It’s built into your definition of done.
Set clear escalation triggers.
Before the task runs, decide: at what point should the agent stop and ask you instead of continuing on its own? Write it in the brief.
The Define-Deliver-Drive framework maps perfectly to physical AI fleet management. In IoT deployments, we manage hundreds of edge AI devices the same way you'd manage an engineering team: each device needs clear decision rights (what it can act on autonomously vs. escalate), WIP limits (how many inference tasks run concurrently given thermal and bandwidth constraints), and observability (real-time health monitoring via eSIM-based connectivity). The failure mode of 'no decision rights = constant human bottleneck' is exactly what happens when edge devices lack OTA governance and require manual intervention for every model update. Autonomous agents need autonomous infrastructure.
This is a great extension. The isomorphism between software agents and physical edge devices is almost perfect. OTA governance is exactly the same problem as deploying a model update mid-task. The infrastructure has to be as autonomous as the thing it's running. What does your observability stack look like for edge device health?
fair point - the impracticality of tracking agent sub-tasks at WIP granularity is real. what I run into is less about control and more about recovery: when an agent goes off-track on a sub-task, knowing where it started drifting matters for the postmortem. not WIP tracking, more like a breadcrumb policy.
What I love about this is that it reframes AI from a tooling problem to a management problem.
Most of the conversations I’m in right now are still focused on prompts, models, and use cases. But what you’re describing shows up much earlier than that.
If “done” isn’t clearly defined, if ownership is unclear, and if too many things are running at once, AI just accelerates the breakdown that was already there.
In transformation work, I see this all the time. Organizations move straight into execution without doing the leadership work of alignment and direction. AI doesn’t fix that. It exposes it.
The interesting shift here is that leaders aren’t just adopting AI. They are being forced to rethink how work is structured, delegated, and verified across the entire system.
That’s where the real opportunity is.
Really appreciate you reading and sharing this, Sherry! That's the core insight for me too. Kacper's framework builds on decades of hard-won lessons in engineering and product management. We don't need to reinvent the wheel, we need to adapt it.
That's exactly it! I'm glad you're seeing similar patterns out there
Great post and key insight from it: AI doesn’t scale without systems. And that makes this more than just an engineering challenge... it’s a management one.
My sense is that “management” itself will need to be redefined, and I’ll be writing more on that soon. What’s working is keeping things simple and modular. Breaking work into smaller chunks and assigning isolated agents to each task is crucial, not just for traceability, but for the quality of the outcome itself.
Absolutely agree that management itself will be changing. I'm curious to read more of your thoughts on that!
Thank you for talking the time to read and comment Just J 🤗
delegation ladder for AI is such a smart way to build trust
Glad you liked it!
This got my butt back in the chair!
the Define-Deliver-Drive breakdown clicks - most teams jump straight to Drive and skip the spec work, then blame the model when output drifts. curious how you handle the delegation ladder when an agent needs to spawn sub-tasks - does that count as a WIP violation or a separate context?
That's an interesting question actually. I don't think you'd want to maintain that fine- grained control of an agent, partly because it's impractical and partly because it's not like it's trying to do something else than you asked. So as long as it's staying on task, I don't care if it needs to spin up sub agents. Sort of like you wouldn't micromanage an engineer in terms of means of achieving the goal, as long as they stay focused on the goal.
Boris Cherny said something along those lines on Lenny's podcast I think- that describing in too much detail "how" the model is supposed to achieve something is too limiting and leads to worse results. You need to be clear on the goals and the guardrails, not necessarily the means
the engineer analogy holds - you scope the goal, not the method.
the edge case I watch for is when sub-tasks start pulling in new context that was not in the original brief. that is where scope creep lives in agentic workflows. the agent is not misbehaving - it just found a gap in the spec.
The delegation ladder maps almost exactly to what I had to build for a marketplace sandbox. Level 1-2 (request/propose) = sandbox phase where the agent can browse and recommend but can't commit. Level 3 (proceed and notify) = graduated progression after the agent has demonstrated it understands the task scope.
The thing that isn't in the framework: what do you do when the agent's task definition is correct but its model of the world is stale? Mine approved a purchase for a product that had been deprecated 6 hours earlier. The escalation trigger fired correctly but the escalation itself hit a dead end (human reviewer didn't have context).
"If you can't verify it, you haven't defined done" is going on a sticky note.
oh that's a good one. I think the world model of the world is one of the rough edges of agents that is not trivially solvable
100% this :)
BTW. Poland, right? :D I will follow, as I seek more people from Europe on Substack!
yes!
Good write up. Like the 5 steps at the end. One thing on my mind is Evals and how to leverage Autosearch for it.
I think evals get hard to perform when we get into complex agentic workflows. They're great if we're just comparing results of single prompts or refining reusable skills though
This is so useful and to the point. Thanks for sharing!
thank you! glad you liked it!
oh this is great! Saving that task brief to test out on my next feature build. The when to esculate is such a good section to add. I feel like Claude is getting a lot better at this, not sure if its cos I've also gotten better at prompting, but it still feels like a good safeguard to add in and consider for each build.
I've noticed it too that Claude Code/Cowork have gotten really "proactive" with asking questions and stopping itself before going in too deep into some dangerous waters.
Still, I believe it's good practice to prompt this as this reinforces the notion for the model that "hey, there's a human here you need to ask for permission in these specific cases"
This is great!!
"Five Rules That Improve Any AI Agent Workflow in 2026"
No task without a definition of done.
If you can’t describe what “finished” looks like before the agent starts, the task isn’t ready.
One task at a time.
Don’t let the agent juggle multiple things at once. Focused work beats scattered work, even when the worker is an AI.
Keep deliverables small.
Give the agent one small piece to finish, not a massive batch. The bigger the output, the less carefully you’ll check it.
Always verify before accepting.
Use checklists, spot checks, or human review, especially for high-stakes work. Verification isn’t something you add after. It’s built into your definition of done.
Set clear escalation triggers.
Before the task runs, decide: at what point should the agent stop and ask you instead of continuing on its own? Write it in the brief.
and the funny thing is, all of these map pretty well to humans as well :)
Good systems make average tools work better.
that's the thing- if you put the tool in the right context and "set it up for success", you can really maximize its strengths even if it's not ideal
only speeds up a bad process, great tool but must still be managed well
Yes! AI is a multiplier- it multiplies bad and good process as well.
The Define-Deliver-Drive framework maps perfectly to physical AI fleet management. In IoT deployments, we manage hundreds of edge AI devices the same way you'd manage an engineering team: each device needs clear decision rights (what it can act on autonomously vs. escalate), WIP limits (how many inference tasks run concurrently given thermal and bandwidth constraints), and observability (real-time health monitoring via eSIM-based connectivity). The failure mode of 'no decision rights = constant human bottleneck' is exactly what happens when edge devices lack OTA governance and require manual intervention for every model update. Autonomous agents need autonomous infrastructure.
This is a great extension. The isomorphism between software agents and physical edge devices is almost perfect. OTA governance is exactly the same problem as deploying a model update mid-task. The infrastructure has to be as autonomous as the thing it's running. What does your observability stack look like for edge device health?
fair point - the impracticality of tracking agent sub-tasks at WIP granularity is real. what I run into is less about control and more about recovery: when an agent goes off-track on a sub-task, knowing where it started drifting matters for the postmortem. not WIP tracking, more like a breadcrumb policy.
https://youtu.be/l5oG8xsDHeA?si=Bc0LgmcZ9OvNvwRa
This is very interesting! Haven't heard about it at all before
Subscribe to the Telegram channel.
https://t.me/Veritaseumofficial
Love this. You can vibe code if you want to move fast.
But you can’t vibe manage AI agents and expect good work.
That’s the same problem you’d have with a human, which is something I wrote about too.
🔗 https://millennialmasters.net/p/ai-tools-management
Ooh, I think your article should be read together with ours, great take on the topic!