The PM Spec Framework for Agent Flows

◢ Chapter 04½ · PM Spec

Spec the seams, not the model.

Value in any agent flow doesn't live in the model — it lives at three seams: handoffs between agents (what passes, what gets lost), decision points (where the agent chooses next action), and evaluation steps (how the system knows it got it right). These nine questions are how a PM designs at those seams. Each step also names a concrete artifact the agent should produce — so 'spec' becomes 'shippable contract'.

◢ The Intelligence Seam

When something feels off about an agent flow, it's almost never the model. It's a seam: a handoff dropped context, a decision point had no fallback, an evaluation step had no rubric. Spec the seams and the model gets to do its job.

◢ The framework

Nine questions. Answer them before you ship.

Trigger

What starts the flow?

The trigger defines the contract boundary — what causes the flow to begin. This is almost always underspecified in early agent designs and is where unexpected or malformed inputs enter the system.

Example

"A support ticket is created with severity ≥ P2 and no assignee within 5 minutes" — not just "user submits a ticket."

◆ Trigger handler emitsConcrete artifact

A typed `FlowStarted` event: { trigger_type, source_id, payload, timestamp, idempotency_key } — logged before any LLM call runs.

Questions to ask

→ Is the trigger an event, a schedule, a user action, or another agent's output?

→ What validates the trigger before the flow starts?

→ Can the same trigger fire multiple simultaneous flow instances?

Input Schema

What structured data does it accept?

Define the shape of inputs explicitly — not just 'user message' but type, source, format, and expected range. Untyped inputs are the leading cause of silent agent failures in production.

Example

{ intent: string, context_docs: string[], user_history_days: 30, language: 'en', confidence_floor: 0.7 }

◆ Validator agent emitsConcrete artifact

A `ValidatedInput` object that conforms to the schema, OR a structured `ValidationError` with field-level reasons. The downstream agent never sees raw input.

Questions to ask

→ What is required vs. optional — and what is the default for optional fields?

→ What happens with malformed or missing inputs?

→ Who is responsible for input validation before the agent sees it?

Context & Memory

What does it need to know?

Agents fail when they assume context they don't have. List every external knowledge source, memory store, or tool the agent needs — and define what happens if any are unavailable at runtime.

Example

Needs: CRM record (real-time pull), product docs (RAG index, updated weekly), user's prior session history (last 90 days).

◆ Context-loader agent emitsConcrete artifact

A `ContextBundle`: { sources_used[], freshness_per_source, missing_sources[], degraded: bool }. The main agent receives this and can reason about what it does and doesn't know.

Questions to ask

→ Is context pulled fresh at runtime or baked into the prompt at deploy time?

→ How stale can context be before output quality degrades meaningfully?

→ What is the graceful degradation path if a source is unavailable?

Output Schema

What does it guarantee to produce?

The output contract is the PM's core design decision. Define not just format but confidence levels, completeness expectations, and what a null or uncertain result looks like. Downstream agents depend on this.

Example

{ summary: string (≤200 words), action_items: string[], confidence: 0–1, sources: url[], fallback_reason: string? }

◆ Generator agent emitsConcrete artifact

Structured JSON conforming to the declared schema, with `confidence`, `sources`, and `schema_version` always populated — never a free-form string.

Questions to ask

→ Who or what consumes this output — a human, another agent, or a database?

→ What does a graceful failure output look like when the agent can't complete?

→ How does the output format version when requirements change?

Success Criteria

How do you know it worked?

This is the hardest part. For agent flows, success is often qualitative — which means you need an evaluation strategy before you have scale data. Define a rubric, not just a feeling.

Example

Human evaluators score ≥ 4/5 on accuracy, completeness, and tone for ≥ 90% of a 100-case test set drawn from real production inputs.

◆ Critic agent emitsConcrete artifact

A scored `EvalReport`: { per_criterion_scores, weighted_total, pass: bool, rationale } — written against a versioned rubric so results are comparable across runs.

Questions to ask

→ What is your evaluation method before you have real traffic volume?

→ Which metrics are leading indicators of user trust — not vanity metrics?

→ Who owns ongoing evaluation: product, engineering, or a dedicated eval function?

Failure Modes & Degradation

The known ways it breaks

Every pattern has characteristic failure modes. Document them explicitly and for each one define: what is the detection signal, and what does the flow do — fallback, notify, escalate, or halt?

Example

If output confidence < 0.6, escalate to human queue instead of auto-responding. If tool call fails, retry once then return structured error.

◆ Supervisor agent emitsConcrete artifact

A `FlowOutcome`: { status: ok|degraded|failed, failure_mode?, action_taken: retry|fallback|escalate|halt, escalation_id? } — every run produces one, success or not.

Questions to ask

→ What is the worst-case output if the flow fails silently without detection?

→ Is there a human escalation path for every documented failure mode?

→ How does the user or downstream system know something went wrong?

Human Touchpoints

Where and why does a human touch this?

Be explicit about every point where human judgment enters — approval, review, correction, audit. Also document touchpoints you've deliberately removed, and why. This list is your accountability record.

Example

Human approves all financial actions > $500. All outputs are logged for async audit within 24h. Memory writes require human approval in v1.

◆ Review-handoff agent emitsConcrete artifact

A `ReviewPacket` for the human: { draft_output, context_summary, why_escalated, suggested_action, deadline } — and a `ReviewDecision` { approve|edit|reject, edits?, reviewer_id } back into the flow.

Questions to ask

→ Is this touchpoint a gate (blocks flow) or async (review-only after the fact)?

→ What information does the human need at this moment to decide well?

→ At what traffic scale does this touchpoint become the operational bottleneck?

Latency & Cost Profile

How long, how much, at what scale?

Agent flows have very different cost structures than traditional software. Parallel Fan-Out with 5 agents costs 5× per run. Reflection loops multiply inference cost. Spec these before committing to a pattern.

Example

P50 latency: 4s. Cost per run: ~$0.08. At 10,000 runs/day = $800/day — within budget at current scale, flag if volume doubles.

◆ Telemetry layer emitsConcrete artifact

Per-run `RunMetrics`: { stage_latencies, prompt_tokens, completion_tokens, tool_calls, est_cost_usd, retries } — aggregated into the dashboard powering Step 5's success rubric.

Questions to ask

→ Which pattern choices are driving the highest cost per successful output?

→ What is the cost-per-correct-output — not just cost-per-run?

→ Where is the latency bottleneck: model inference, tool calls, or orchestration overhead?

Composability Notes

What does this plug into?

Design agent flows to be composable from day one. Document which pattern this flow uses internally, what it can be nested inside, and which downstream flows can consume it as a worker or sub-agent.

Example

This flow implements Orchestrator-Worker internally. It exposes a clean JSON interface and can be used as a worker in a higher-level Pipeline or called by an Orchestrator.

◆ Flow registry emitsConcrete artifact

A published `FlowManifest`: { flow_id, version, input_schema, output_schema, side_effects, required_secrets, sla } — the contract any other flow uses to call this one.

Questions to ask

→ Does this flow expose a clean enough interface to be reused across contexts?

→ What assumptions does it make about its execution environment?

→ How would you version this flow if its interface needs to change?

◢ See it applied

The framework in the wild.

Case study

AI PM Briefing — full PM contract

See every spec slot filled in for a real Fan-Out + Reflection post generator: input schema, critic rubric with weights, failure modes, human touchpoints.

Patterns

Each pattern asks different PM questions

Routing forces classifier-confidence questions. Fan-Out forces synthesis questions. Memory forces governance questions. The pattern picks the spec.