PM Design Patterns for Agent Flows

◢ Chapter 04¾ · PM Patterns

PM design patterns for agent flows.

The engineering patterns in Chapter 02 are the LEGO bricks. These are the recipes — patterns that emerged specifically from PM-led agent design where the hard problem isn't 'how does the model work' but 'how do we make it trustworthy, on-voice, and operationally sane'. Each one is mapped to (a) the concrete flow components that implement it and (b) the PM Spec slots it forces you to think hardest about.

◢ How to read each pattern

Every card has the same shape: Problem · Solution, then a side-by-side map of flow components (left) and the PM Spec slots (right) the pattern forces you to nail down. Use the family chip to scan: Quality, Governance, Voice, or Cost.

◢ The library

Six PM-shaped patterns. Grouped by what they solve.

Quality

02Critic Rubric
04Diversity Fan-Out

Governance

01Confidence-Gated Escalation
05Memory Write-Approval

Voice

03Voice-Anchored Generation

Cost

06Cheap-First Cascade

PM Pattern · 01

Confidence-Gated Escalation

GovernanceBuilds on: Routing + HITL

Problem

Agents that auto-respond to everything fail loudly on edge cases. Agents that escalate everything kill operational efficiency.

Solution

Generate a confidence score with every output. Auto-handle above the threshold; route below it to a human queue with full context attached.

◢ Maps to flow components

GeneratorPrimary LLM with structured output
Returns answer + self-rated confidence 0–1
Threshold gateDeterministic logic
Compares confidence vs. policy threshold (often 0.7–0.85)
Auto-pathTool call or response
Sends final output to user or downstream system
Escalation queueDatabase + notification
Routes case to human reviewer with input, draft, and confidence breakdown

◆ Stresses these PM Spec slots

05 · Success Criteria
Defines the threshold above which auto-action is acceptable
06 · Failure Modes
The escalation path IS the documented failure mode
07 · Human Touchpoints
Sets clear rules for when humans get involved

Example in the wild

Support agent auto-resolves tickets with confidence ≥ 0.8. Below that, ticket is queued for human review with the agent's draft, sources, and uncertainty reasons attached.

PM Pattern · 02

Critic Rubric

QualityBuilds on: Evaluator–Optimizer

Problem

'Good' is subjective. Without an explicit rubric, the Reflection loop has nothing meaningful to improve against and the critic agent nitpicks irrelevantly.

Solution

Encode quality as a weighted, explicit rubric the critic LLM scores against. Each criterion has a weight, a definition, and a 'what good looks like' example.

◢ Maps to flow components

Rubric documentMarkdown / JSON config
Weighted criteria with examples — versioned alongside the prompt
Critic agentLLM with structured output
Returns per-criterion score + overall weighted score
Score thresholdDeterministic logic
Below threshold triggers regeneration; above triggers acceptance
Editor agentLLM with diff capability
Applies critic feedback to produce next iteration

◆ Stresses these PM Spec slots

05 · Success Criteria
The rubric IS the success criteria, made operational
04 · Output Schema
Critic output is itself structured — scores must conform
08 · Latency & Cost
Each rubric pass adds an inference call; weights reflect what's worth that cost

Example in the wild

AI PM Briefing post generator: Hook 30% · Voice 25% · Signal 20% · Format 15% · CTA 10%. Critic must score ≥ 7.5 weighted before any variant is shown to the human.

PM Pattern · 03

Voice-Anchored Generation

VoiceBuilds on: Single Agent + Few-Shot

Problem

LLM outputs sound generic, hedged, and AI-flavored by default. For brand-voiced or persona-driven content, generic output kills trust.

Solution

Anchor every generation pass against a curated voice guide and a few-shot pack of canonical good outputs. The critic also scores against the voice guide.

◢ Maps to flow components

Voice guideMarkdown document
Tone markers, phrases to use, phrases to avoid, structural rules
Few-shot pack5–10 archetypal examples
Real past outputs that exemplify the voice; loaded as user/assistant pairs
GeneratorLLM with system prompt
Receives voice guide in system prompt, few-shots in messages
Voice criticLLM with rubric subset
Scores tone/voice match independently from content quality

◆ Stresses these PM Spec slots

03 · Context & Memory
Voice guide and few-shots are the load-bearing context
05 · Success Criteria
Voice match is often the most important quality dimension
07 · Human Touchpoints
The brand owner must be the one curating the voice guide

Example in the wild

LinkedIn Writer loads a voice guide describing Rahul's tone, plus 8 of his best-performing posts. The critic explicitly scores 'Voice match' as 25% of overall quality.

PM Pattern · 04

Diversity Fan-Out

QualityBuilds on: Parallelization / Fan-Out

Problem

Vanilla parallel sampling produces three rewrites of the same idea. The point of fan-out is genuine diversity — different angles, not different phrasings.

Solution

Predefine N distinct generation strategies (hook types, framing angles, personas) and fan out one variant per strategy. The critic scores; human picks.

◢ Maps to flow components

Strategy definitionsConstants in prompt config
E.g. Hook A: surprising stat · Hook B: trend read · Hook C: contrarian take
Parallel generatorsN LLM calls in parallel
Each receives same input + its assigned strategy
CriticSingle LLM call scoring all N
Comparative scoring — relative quality, not absolute
Selection UISide-by-side diff view
Human picks; selection logged for future strategy tuning

◆ Stresses these PM Spec slots

04 · Output Schema
Output is N variants with strategy tags + comparative scores
07 · Human Touchpoints
The human selection step is where the pattern delivers its value
08 · Latency & Cost
N× generation cost — strategy diversity must justify it

Example in the wild

AI PM Briefing post generator runs 3 fan-out strategies (surprising claim · trend read · contrarian take) in parallel, scores all three, surfaces the best for human selection.

PM Pattern · 05

Memory Write-Approval

GovernanceBuilds on: Memory-Augmented + HITL

Problem

Agents that write to long-term memory unsupervised will eventually poison their own context. Bad data persists and compounds over time.

Solution

All memory writes pass through a human approval gate in v1. Once you have signal on what good memories look like, replace human approval with a critic LLM gated by rubric.

◢ Maps to flow components

Fact extractorLLM with structured output
Proposes candidate memories from a session — never writes directly
Approval queueDatabase + UI
Pending memories shown with source context for human review
Memory storeVector DB or KV store
Only writes accept-flagged memories
Audit logAppend-only table
Every write — and who/what approved it — is logged for rollback

◆ Stresses these PM Spec slots

06 · Failure Modes
Memory poisoning is the worst-case failure this pattern prevents
07 · Human Touchpoints
Human approval is the load-bearing safeguard in v1
09 · Composability
The pattern must compose cleanly with whatever pattern uses the memory

Example in the wild

Personal assistant agent extracts 'user prefers morning meetings' from a chat. The memory is queued; user sees and approves before it persists to long-term memory.

PM Pattern · 06

Cheap-First Cascade

CostBuilds on: Routing + Reflection

Problem

Using a frontier model for every step burns budget on tasks where a small model would do. But hardcoding 'use small model here' is brittle as quality bars shift.

Solution

Try the cheap model first. If its output fails a structured quality check, escalate to the expensive model. Log both outputs for ongoing tuning.

◢ Maps to flow components

Tier 1 modelSmall/cheap LLM (e.g. Flash/Mini)
Handles the 70–80% common case
Quality gateStructured validation or critic LLM
Checks output against rubric — fails fast on weak responses
Tier 2 modelFrontier LLM (e.g. Pro/GPT-5)
Only invoked when Tier 1 fails the gate
Tier-shift telemetryLogging + dashboards
Tracks escalation rate; informs future tier-routing decisions

◆ Stresses these PM Spec slots

08 · Latency & Cost
The whole pattern exists to optimize cost-per-correct-output
05 · Success Criteria
The gate's quality threshold determines cost-quality tradeoff
06 · Failure Modes
Define what happens when both tiers fail the gate

Example in the wild

Support classifier uses a small model for routine intents. If classification confidence < 0.7, the case re-runs on the frontier model before being dispatched.

◢ Keep going

Now spec one and ship it.

◆ The 9-step PM Spec Framework →See the patterns combined in one flow →← Back to engineering patterns