AI AI Tools Our Perspectives

Building the Brain Behind AI Coaching

Bella Williams
10 min read

Ever tried to get an AI to stick to a script? Yeah, me too. 🤦‍♂️

When we set out to build an AI coaching product, I thought the hard part would be making it sound human. Turns out, the real challenge was getting it to follow instructions while also sounding human. Who knew?

The Problem: An AI With Three Personalities

Here’s what we needed to build:

Knowledge Assessment Mode: The AI needed to be a strict examiner—ask specific questions from uploaded materials, check answers against facts, and never, ever make stuff up.

Skills Practice Mode: The AI needed to be a supportive trainer—improvise naturally, push users with follow-ups, and know when practice goals were met.

Guided Prompting Mode: The AI needed to follow a blueprint while adapting to conversation flow—structured enough to hit key points, flexible enough to feel natural.

Oh, and these three modes needed to live in the same system without stepping on each other’s toes. No pressure.

What Everyone Gets Wrong About AI Coaching

When you tell people you’re building an AI coach, they assume it’s easy. “Just throw it at GPT-4 and you’re done, right?” Wrong. Here are the myths we had to bust:

“One Prompt Can Do Everything”: Nope. Trying to cram assessment rules AND roleplay personality AND guided conversation flow into a single prompt is like asking someone to be a drill sergeant, therapist, and improv actor simultaneously.

“The AI Will Just Know What to Do”: The model doesn’t magically understand your assessment structure or conversational blueprints. Without explicit control, it skips questions, hallucinates facts, and generally does whatever it wants.

“JSON Output Is Reliable”: Ha! The number of times we got malformed JSON or creative interpretations of our schema would make you cry.

“Unclear User Answers Will Sort Themselves Out”: When a user gives a vague response, the AI needs a strategy, not permission to improvise endlessly.

“Flexibility and Control Are Mutually Exclusive”: This was the big one. We thought we had to choose between rigid scripts and natural conversation. Turns out, you can have both with the right architecture.

Our First Attempt (AKA: The Disaster)

We did what everyone does first: threw everything at a single LLM instance and hoped for the best.

The setup was simple:

One big prompt with the assessment script, evaluation criteria, and conversation guidelines all mixed together
Ask the model to self-report what questions it asked
Use some janky parsing to extract answers from its output
Cross fingers and ship it

It was a beautiful disaster.

The AI invented facts. It skipped questions. It asked random follow-ups that led nowhere. When we asked it to evaluate itself, it was about as reliable as asking a student to grade their own test. And forget about natural conversation flow—it either sounded like a robot reading from a script or went completely off-script.

We ran simulations. Only 62% of assessments actually followed the script. Nearly a third failed because the AI just… forgot to ask certain questions. Another 10% failed because it confidently stated “facts” that didn’t exist in the uploaded documents.

The guided conversations weren’t any better. The AI would either stick too rigidly to templates (feeling robotic) or wander off into conversational tangents that never accomplished the training goals.

We needed a new approach. Badly.

The Breakthrough: Stop Trusting the AI

The key insight hit us during a particularly frustrating debugging session: We were giving the AI too much power.

Think about it—when you train a human coach, you don’t just hand them a manual and say “figure it out.” You give them a structured program, checkpoints, rubrics, and supervision. You also give them flexibility within boundaries. Why were we trusting an AI to do more than we’d trust a human?

So we flipped the script entirely: The code would be the boss. The AI would be the worker.

The New Mental Model

Instead of one monolithic AI brain trying to juggle everything, we built three specialized components working together:

The Dialogue Graph Engine: This is the script—an actual graph structure that represents every question, every possible answer path, every decision point, and every conversational blueprint. It lives in our code, not in a prompt.

The LLM Task Runner: The AI gets narrow, specific jobs—”extract an answer in this exact format,” “ask this clarifying question,” or “generate a response that hits these conversational beats.” That’s it. No freelancing.

The Evaluation Engine: Scoring happens in code using explicit rules. No more asking the AI to judge itself.

This separation was everything. Suddenly, we had control and flexibility.

How It Actually Works

Let me walk you through what happens when a user interacts with the system now:

The Dialogue Graph: Your Source of Truth

Every assessment is a graph. Each node represents a specific moment in the conversation with:

The exact prompt template
The expected answer format (strict JSON schema for assessments, flexible for practice)
Validation rules (like “year must be between 1900 and 2025”)
Node type flags: strict (Knowledge Assessment), flexible (Skills Practice), or blueprint (Guided Prompting)
What happens next based on the answer and conversation flow

When a user starts, we’re at node 1. They answer, we validate, we move to the next node. It’s deterministic. Repeatable. Auditable. But it’s also smart enough to adapt when needed.

The LLM’s Actual Job: Scoped and Focused

When we hit a node, the LLM gets a super focused task that varies by mode:

For Knowledge Assessment nodes:
“Here’s the question. Here are relevant excerpts from the uploaded documents. Extract the answer in this exact JSON format. Nothing else.”

For Skills Practice nodes:
“You’re a supportive trainer. The user is practicing negotiation. Respond naturally, push them with a follow-up that challenges their approach. Report back which training objectives you covered in this hidden structure.”

For Guided Prompting nodes:
“Follow this conversational blueprint. You need to cover these three key points, but adapt your phrasing to the user’s communication style. Emit blueprint tokens showing which beats you’ve hit.”

We set appropriate token limits for each mode. We validate outputs immediately. If something doesn’t match expectations? We have specific recovery strategies for each mode.

The Grounding Check: Trust But Verify

After the LLM extracts an answer in assessment mode, we don’t just trust it. We:

Convert the answer to an embedding vector
Compare it against the actual uploaded document chunks
Calculate similarity

If the similarity score is too low, we know the AI might be making stuff up. That triggers our clarification system.

For guided prompting and skills practice, we use similar techniques but with looser thresholds—we want to catch hallucinations while allowing natural paraphrasing.

The Clarification Cascade: Simple Rules, Better Results

When an answer is unclear, ambiguous, or potentially hallucinated, we have a policy:

Ask a templated clarifying question targeting exactly what’s missing
If that doesn’t work, ask one more time differently
Still unclear? Send it to a human reviewer

This simple two-attempt rule prevents endless back-and-forth while still giving users a fair chance to clarify.

The Three Modes In Action

Here’s how each mode uses the same infrastructure differently:

Knowledge Assessment: Maximum Control

In assessment mode, nodes are marked as “strict.” The AI has zero freedom to deviate. Every question must be asked. Every answer must be validated against uploaded content. The graph traversal is completely deterministic.

Example flow:

Node: “What year did the user graduate?”
AI extracts: {“year”: 2015}
System validates: year is in valid range, matches document
System moves to next node

If validation fails, the clarification cascade kicks in. If that fails, human review. No guessing allowed.

Skills Practice: Controlled Freedom

In practice mode, nodes are marked as “flexible.” The AI gets to improvise responses, but it still has to report back what it accomplished through hidden metadata.

Example flow:

Node: “Practice handling objections”
AI generates natural response to user’s objection
AI emits hidden tokens: {“covered”: [“empathy”, “reframing”], “user_quality”: “defensive”}
System checks: required objectives covered? Yes. Move to next practice scenario.

The graph still controls the flow, but the AI has breathing room to be conversational.

Guided Prompting: The Best of Both Worlds

This is where it gets really interesting. Guided prompting is the hybrid mode—the sweet spot between rigid scripts and total freedom.

Blueprint Nodes: We created nodes with “soft schema enforcement.” The AI can generate natural, free-flowing dialogue, but behind the scenes it emits “blueprint tokens” that tell our system what conversational goals it achieved.

Think of it like improv theater with rules. The actor (AI) has freedom to improvise their lines, but they have to hit certain beats and advance the scene toward specific goals.

Example flow:

Blueprint node: “Discuss project timeline and identify blockers”
AI has natural conversation about the project
AI emits hidden tokens: {“discussed_timeline”: true, “identified_blockers”: [“resource shortage”, “unclear requirements”], “user_clarity”: “high”}
System checks: Did we cover the required talking points? Yes. Were there signs of confusion? No. Move forward.

The user experiences natural conversation. Our system tracks whether the AI is actually accomplishing the conversational objectives.

Scenario Injection: We attach a “scenario object” to each session—a JSON bundle containing:

Key constraints the AI must respect
Roleplay setup and context
Communication style guidelines
Current user state (confident, hesitant, confused)

The AI’s system prompt gets enhanced with these constraints while still obeying node-level rules. So if you’re guiding someone through a difficult customer conversation, the scenario might say “Customer is frustrated about billing, be empathetic but firm, cover these three points before moving on.”

Flow Adaptation: The really clever bit? We added a small local classifier that detects communication markers in real-time:

Is the user hesitant?
Are they being verbose or terse?
Are they showing clarity or confusion?
What’s their emotional state?

Based on these signals, the blueprint adapts. If a user is highly hesitant, the system shifts to more supportive follow-ups. Or they’re rushing through without understanding, it might slow them down with reflection prompts. And if they’re crystal clear and moving fast, it accelerates the conversation.

This hybrid approach—structure where we need it, flexibility where it matters—lets us use the same infrastructure for rigid Knowledge Assessment, natural Skills Practice, and adaptive Guided Prompting.

What Changed

The difference was night and day.

Assessment compliance jumped from 62% to 91%. The AI actually follows the script now.

Hallucinations dropped from 10% of sessions to under 2%. Grounding checks work.

Speed improved by 30%. Focused prompts with token limits are faster than rambling responses.

Guided conversations actually guide now. Users report that blueprint-driven conversations feel natural while still accomplishing clear objectives. The AI doesn’t wander off or skip important points.

But beyond the numbers, the system became explainable. When a user asks “why did I fail?” or “what should I focus on?”, we can show them exactly which question they missed, which conversational objectives they haven’t covered, and why. That transparency builds trust.

The Tradeoffs Nobody Talks About

Was it worth it? Yes. Was it free? Hell no.

We traded simplicity for control. Our system is way more complex now. We maintain graph schemas, validation rules, blueprint definitions, and an entire orchestration engine. Every new assessment or guided conversation requires careful authoring.

We traded prompt flexibility for determinism. Super strict schemas can feel rigid. We needed soft nodes, blueprint modes, and adaptive flow to add breathing room. Content creators need training to author good blueprints.

API calls increased. More structured nodes mean more individual LLM calls. Our inference costs went up about 25% per session. We offset some of this by caching common prompts and using smaller models for non-critical tasks.

Thresholds need tuning. The grounding threshold, clarification triggers, and flow adaptation signals all needed empirical calibration. They’ll drift over time as models and data change. We needed continuous monitoring and A/B testing infrastructure.

But here’s the thing: the product actually works now. Users trust it. It scales. We can audit every decision. Assessment is rigorous. Practice feels natural. Guided conversations actually accomplish goals. That’s worth the complexity.

What We Learned (So You Don’t Have To)

If you’re building anything that needs AI to be both controlled and natural:

Separate orchestration from inference. Put control in your code, not in prompts. The AI is a worker, not a manager. This is the foundation everything else builds on.

Design node contracts first. Every AI task should have a clear input format, output schema, and validation rules. Define whether a node is strict, flexible, or blueprint-driven. Ambiguity is the enemy.

Ground your answers when facts matter. If you’re working with user-uploaded content, use embeddings to check that AI responses actually match the source material. Set thresholds empirically and monitor them.

Keep clarification simple. Two attempts at clarification, then escalate. Don’t let conversations spiral. This works across all three modes.

Build hybrid modes from the start. You’ll need strict, flexible, and blueprint-driven behaviors. Don’t bolt on flexibility later—design your system to support all three from day one.

Invest in authoring tools. Build UIs for creating graphs, visualizing flows, defining blueprints, and testing scenarios. You’ll spend more time creating content than you think. Make it easy.

Instrument everything. Track node success/failure, clarification rates, blueprint token coverage, human escalations, and grounding scores. Build replay capabilities. You can’t improve what you can’t measure.

Human-in-the-loop is a feature, not a failure. A small percentage of sessions needing human review is totally acceptable. Build the operator UI from the start.

The Real Win

We built an AI that knows when to follow the script exactly, when to improvise naturally, and when to guide a conversation toward goals while adapting to the human in front of it.

It’s deterministic when it needs to be, natural when it should be, adaptive when it matters, and honest when it doesn’t know something.

That’s not just good engineering—it’s what makes AI coaching actually useful instead of just impressive.

The secret? Stop trying to make one AI do everything. Build a system where code handles structure, the AI handles inference, and the boundaries between modes are explicit and intentional.

Ready to build AI systems that actually follow instructions while staying human? Start by trusting your code more than your prompts, and design for multiple modes from day one.

Your future self will thank you. 🚀

Analyze & Evaluate Calls. At Scale.

Analyze & Evaluate Calls.
In Minutes

Building the Brain Behind AI Coaching

The Problem: An AI With Three Personalities

What Everyone Gets Wrong About AI Coaching

Our First Attempt (AKA: The Disaster)