Our Perspectives Archives - Insight7 - Call Intelligence & Coaching for Customer teams

CX Meets AI: Engineering Call Intelligence That Actually Listens

The Business Challenge: When Support Calls Are a Black Box Before we built our CX dashboard, support calls were essentially invisible to operations and product teams. Companies were sitting on thousands of customer conversations every week — containing critical signals about product issues, service gaps, sales opportunities, and operational failures — but had no systematic way to extract insights. The cost of this blindness was real: Operations couldn’t pinpoint root causes of repeat calls Service leaders had no visibility into whether agents were resolving issues effectively or escalating unnecessarily Sales opportunities mentioned in support calls were completely invisible, leading to missed revenue Without data, process improvements were based on gut feel rather than evidence Why this matters: For companies handling thousands of support calls, even small improvements compound dramatically. A 5% increase in first-call resolution saves agent hours and improves customer satisfaction. Identifying just 10% more sales opportunities in existing calls can represent significant revenue. Catching product issues earlier reduces churn and development waste. The market needed a solution that could actually listen to calls at scale — not just track call volumes, but understand intent, sentiment shifts, product mentions, and operational patterns — and surface this intelligence in a way that operations, product, and sales teams could act on immediately. That’s the problem we set out to solve. Here’s how we built it. The Problem We Faced We were building a customer experience (CX) dashboard to give operators and product teams clear visibility into what happens on support calls. The dashboard had four main sections — Product, Service Quality, Sales Opportunity, and Operations — and needed to surface everything from sentiment trends (initial/mid/final), first-call resolution (FCR) and effective communication KPIs, to operational call-reason breakdowns (warehouse fulfillment failures, wrong-item complaints), and escalation trails. The hard constraint: the dashboard must actually “listen” — not just show volumes — and be dynamic enough to meet wildly different customer needs without rebuilding the whole stack for every account. At project start we had raw call recordings, partial metadata (agent ID, time-on-call), and a set of business questions from stakeholders. We needed: accurate transcription and diarization; robust intent and reason classification; sentiment over call segments; extraction of product mentions and feature requests; sales-opportunity detection; and per-call resolution/escalation tracking — all updated frequently enough to inform operations. Our First Attempt Our initial architecture was straightforward and familiar: stream recordings into S3, run a single ASR (automatic speech recognition) model across everything, push transcripts into a classic NLP pipeline (heuristic regex + keyword lists + a light classifier), and layer metrics into a single monolithic BI dashboard (React + Superset). We used Amazon Transcribe (standard) for ASR, a simple speaker-turn heuristic for diarization, Vader for sentiment, and a logistic-regression classifier trained on 3k labeled call excerpts for reasons (fulfillment, wrong-item, billing, etc.). We shipped a V1 dashboard to a pilot customer within eight weeks. That rollout taught us a lot quickly: Transcription quality varied wildly (WER 18–35%) depending on line noise, accents, and domain phrases — which cascaded into downstream NLP errors. Our single-model approach missed intent nuances: feature requests vs. complaints vs. comparisons were often conflated. Sentiment aggregated for whole calls hid critical dynamics: an angry start with a calm resolution still showed “neutral” overall. The dashboard’s static metric set didn’t align with some customers’ KPIs — one wanted detailed escalation timelines, another wanted product feature-mentions grouped differently. We tracked core metrics: end-to-end processing latency was ~45 minutes per call (batch-only), reason-classifier precision ~0.77, recall ~0.69, and sentiment accuracy ~0.72 against hand-labeled samples. Those numbers were neither stable nor sufficient for operational trust. Why It Failed We learned why the naive stack failed in production: Error Amplification: ASR errors (high WER) directly reduced classification and named-entity extraction accuracy. A single bad transcription could flip a call’s reason tag. One-Size-Fits-All Models: Domain variance (different product names, jargon, call scripts) meant a global model underfit most customers and overfit the pilots. Temporal Blindness: Aggregating sentiment once per call missed transitions (initial frustration → mid-call calming → final satisfaction). KPIs like “effective communication” require segment-level signals. Static Dashboarding: The monolithic dashboard had hardcoded metric definitions and required engineering to add any new view. Customers wanted dynamic breakouts, e.g., seeing “wrong-item” split by warehouse ID or by SKU family — not possible without rebuild. Trust Gap: Operations needed auditable evidence (timestamps, utterance text, escalation points). Our pipeline didn’t carry provenance metadata end-to-end. We could have iterated the original system ad infinitum, but that would chase symptoms. We needed architectural changes that reduced upstream fragility, enabled per-customer specialization, and provided explainability. The Breakthrough We reframed the problem: rather than a single pipeline that outputs “answers,” we would build a modular call-intelligence platform that produces trusted, auditable artifacts (segment-level transcripts, time-aligned sentiment and intent labels, entity records, and embeddings) and a flexible dashboard layer that composes views from these artifacts by configuration. The key design pillars became: Modular audio processing with fallbacks Segment-aware NLP and time-series sentiment Per-customer configuration and on-demand specialization Explainability and provenance at every step Operational SLAs for latency and accuracy Below we describe the architecture, implementation choices, and how we operationalized trust. Implementation Details Models and training: Labeled dataset: 40k calls (>2M utterances) aggregated from pilots (consented), stratified by product line and geography. Holdout test: 5k calls. Intent/reason model: DeBERTa v3-small fine-tuned (Hugging Face Transformers v4.30), initial learning rate 2e-5, batch 32, trained for 4 epochs. Precision/recall on test set: 0.86 / 0.82 (F1 0.84). Sentiment model: BERT-based classifier for 3-way sentiment, but we also used a regression score combined with a rule engine to identify sentiment flips; 3-window accuracy = 0.88. ASR: We tuned acoustic adaptation on common product names and used Lexicon/Custom-Vocabulary features in Amazon Transcribe for customer-specific terms. For persistent low-confidence segments (<0.6), Whisper-large-v2 fallback reduced WER by ~6 percentage points. Per-customer customization: Config model: each customer has a JSON schema describing product taxonomy, escalation tags, radar metrics, and dashboard templates. These configs drive NER dictionaries, custom vocabularies, and dashboard breakouts. Feature toggles:

Every CEO Wants AI-Driven Growth. Most Are Looking in the Wrong Place

I spend a lot of time meeting with CEOs across industries and every single one of them is thinking about AI. Some are already adopting AI tools. Others are deep in evaluations, building business cases, running pilots. A few are still in the exploration phase, trying to separate hype from reality. But they all share the same goal: use AI to drive growth in 2026. What a number of them miss however is that the biggest unlock for AI isn’t the tools themselves but the data and context you feed them. The Expensive Detour The default playbook goes something like this: buy the latest AI tool, implement the newest model, chase what your competitors are doing. New sales AI. New customer service automation. New analytics platform. Another AI layer on top of your existing stack. Each promises transformation. Each costs five, six or seven figures. Each takes months to implement. And most deliver incremental improvements at best. Why? Because you’re trying to build on empty ground. The Answer Is Already in Your Business Most companies are already sitting on the raw material for AI-driven growth. It’s in the thousands of customer conversations happening every week across the business. Sales calls. Support tickets. Implementation check-ins. Success reviews. Onboarding sessions. These conversations contain insights that need to be unlocked: Why customers aren’t buying Where reps excel or struggle What messaging actually resonates What products are in demand Which objections kill momentum What is changing in the market The insights that could transform your revenue trajectory are already there. You’re just not extracting them. The Hidden Cost of Siloed Insights Even when companies analyze conversations, most never capture their full value. The problem isn’t a lack of insight — it’s that insights are trapped, siloed, and disconnected from the people who can act on them. Its a problem we are solving at Insight7.io and the result appears everywhere: Leadership sees fragments, not patterns: Decisions on go-to-market, product development and training rely on anecdotes, not reality. Managers can’t scale coaching: Feedback stays generic because they lack tools and context to develop each rep based on real performance. Reps don’t improve: Delayed, vague feedback disconnected from actual conversations keeps performance flat. When insights are siloed like this, AI tools alone won’t move the needle. Data is only valuable when it flows to the people who can act on it. The Unlock: Conversation Intelligence Across Your Business The companies that will succeed with AI in 2026 won’t be chasing the newest AI model. They will find ways to systematically unlock the intelligence buried in their customer conversations. These companies will: Evaluate 100% of customer interactions — not just a sample Surface patterns across every touchpoint (sales, support, success, implementation) Generate personalized coaching at scale so reps actually improve Flow insights automatically to leadership for strategic decisions This isn’t about replacing human judgment. It’s about giving your people the intelligence layer they need to perform at their best. The Question for 2026 Instead of asking: “How do we use AI to grow?” The better question is: “Are we using the data we already have?” Before you buy another AI tool, ask yourself: What percentage of our customer conversations are we actually learning from? Do insights from those conversations reach the people who can act on them? Can we turn conversation data into systematic coaching and strategic intelligence? If the answer is no, you’re not ready for more AI tools. You’re ready for conversation intelligence. The companies that figure out conversation intelligence won’t just win in 2026 — they’ll build an advantage that’s impossible to copy. If you’re interested in unlocking your customer conversation data, we’re solving this at Insight7.io. Reach out.

Automated Call Transcript Summarization: Achieving Precision with Configurable Templates

The problem – Teams came to us for speed. They had call transcripts and needed a fast way to extract what mattered – a quick TL;DR they could act on. Our summarization service delivered that, and customers relied on it heavily. But as usage grew, the same request kept coming up: “Can we control the format?” Instead of a generic summary, customers wanted outputs that matched how they already worked —- an email follow-up ready to send, an executive one-pager for leadership, or a checklist with prioritised action items. They weren’t asking for more text. They were asking for predictable structure. What they needed were summaries that came back in the exact format they specified, every time. Why does this matter? Customers needed to feed summaries into downstream systems like CRMs and ticketing platforms. When field names changed or required sections were missing, those integrations broke. Customers couldn’t build reliable automations on top of unpredictable outputs. Before we solved this, enterprise customers were manually editing generated summaries to fix formatting issues, wasting time on work that should have been automated. Legal and compliance teams couldn’t rely on summaries when format consistency wasn’t guaranteed. What’s the benefit of solving it? After implementing our solution, we achieved 92% structural adherence – summaries now reliably match customer templates. The business impact was significant: 75% reduction in manual edits: Enterprise customers stopped spending time reformatting AI outputs Reliable automation: Customers could now build downstream automations relying on consistent field names and types Faster enterprise adoption: Customers who needed CRM and ticketing system integration adopted the feature quickly Increased trust: Legal and compliance teams gained confidence from audit logs and consistent formatting The difference between 62% and 92% accuracy meant the difference between summaries that required constant human cleanup and summaries that could power business-critical workflows. Our First Attempt Our initial implementation was minimal: accept a free-form template string from users, append it as an instruction to the summarization prompt, and call a single large model (OpenAI GPT-4) with the transcript context. The pipeline looked like: Transcription (Whisper v1) -> transcript text Prompt = “Summarize the call according to this template: [user template]” + transcript One-shot model call -> return text to user This approach worked quickly in demos and solved some cases, but it failed in the real world for several reasons: Prompt sensitivity: Outputs varied based on subtle template wording. When a customer used imprecise language (e.g., “Make it sound like an email but not too formal”), the model interpreted that differently each run. Structural drift: Headings were renamed, placeholders were dropped, or sections were merged. We saw ~62% structural adherence (heading names + presence of required placeholders) across a 1,000-template test set. Malicious / invalid templates: Templates with embedded HTML, code, or attempts to override system instructions could produce unexpected output or security concerns. Uncontrolled token usage: Long templates + long transcripts led to high token use and unpredictable costs. User error: Many users submitted templates with ambiguous placeholders or filler words, increasing “garbage in, garbage out” failure modes. We tried several incremental fixes: stricter front-end validation, examples to users, and a longer prompt telling the model to “follow headings exactly”. None of these reliably fixed the core problem. The more we leaned on the single-model approach, the more we saw variable fidelity across template styles and transcripts. The Solution We adopted a layered, deterministic pipeline that treats the user template as a first-class artifact: parse → sanitize → canonicalize → plan → generate → validate. The core idea: don’t hand raw user text to the generative model and hope. Instead, turn the template into a machine-checked specification (a schema), use a controlled “meta-prompt” to convert the template into strict generation instructions, and validate output against that schema. We split responsibilities across smaller, specialized components so each step is auditable and testable. Architecture overview (components and tools) Ingress: API (Kubernetes 1.26, FastAPI on Python 3.11) Storage: S3 for transcripts, PostgreSQL 15 for metadata Workers: Celery 5.2, Redis 7 for task queue and caching Models: OpenAI GPT-4 / gpt-4o-mini for generation, GPT-4-Fast for meta-prompting when we needed speed Libraries: pydantic v1.10, jsonschema 4.17, spaCy 3.5 for NER, bleach for sanitization Monitoring: Prometheus + Grafana, Sentry for errors Key pipeline stages Template Sanitization Strip HTML, disallowed control characters, and executable code with bleach and regex filters. Enforce length limits: template body < 4,096 chars (configurable). Extract explicit placeholders (we support simple placeholder syntax: {{name}}, {{action-items}}, etc.). Template Parsing & Schema Generation We convert the cleaned template into a JSON Schema / “blueprint” that captures required sections, headings, and data types (string, list, bullets, optional/required). We validate that the template contains at least one stable anchor (e.g., at least one heading or placeholder). If not, we return a friendly error with suggested fixes. Example conversion rule: a line starting with “###” becomes a required object property; a bullet-list instruction becomes an array type. Meta-Prompting (Prompt-of-a-Prompt) We generate a compact, deterministic instruction for the generator model by combining: The normalized schema (short). Example outputs that match the schema (we keep a library of 60 curated examples). Constraints: JSON-only output when requested, strict heading names, maximum token lengths for sections. We use a small, faster model (gpt-4o-mini or an optimized instruction-tuned variant) to turn the user’s natural-language template into the canonical meta-instructions if parsing heuristics cannot deterministically infer the full schema. Constrained Generation We ask the model to produce output that either: Emits JSON conforming to the schema, OR Emits text with exact headings and clearly delimited sections. We favor JSON output when downstream systems need to programmatically consume summary fields. Validation & Repair We validate the model output against the schema using jsonschema. If it fails, we run a repair pass: Identify missing required fields and call the model with a focused prompt: “You missed X. Fill it using transcript references. Answer only the field X.” We allow up to two repair attempts before falling back to a deterministic extractor (rule-based NER + regex) for

Building the Brain Behind AI Coaching

Ever tried to get an AI to stick to a script? Yeah, me too. 🤦‍♂️ When we set out to build an AI coaching product, I thought the hard part would be making it sound human. Turns out, the real challenge was getting it to follow instructions while also sounding human. Who knew? The Problem: An AI With Three Personalities Here’s what we needed to build: Knowledge Assessment Mode: The AI needed to be a strict examiner—ask specific questions from uploaded materials, check answers against facts, and never, ever make stuff up. Skills Practice Mode: The AI needed to be a supportive trainer—improvise naturally, push users with follow-ups, and know when practice goals were met. Guided Prompting Mode: The AI needed to follow a blueprint while adapting to conversation flow—structured enough to hit key points, flexible enough to feel natural. Oh, and these three modes needed to live in the same system without stepping on each other’s toes. No pressure. What Everyone Gets Wrong About AI Coaching When you tell people you’re building an AI coach, they assume it’s easy. “Just throw it at GPT-4 and you’re done, right?” Wrong. Here are the myths we had to bust: “One Prompt Can Do Everything”: Nope. Trying to cram assessment rules AND roleplay personality AND guided conversation flow into a single prompt is like asking someone to be a drill sergeant, therapist, and improv actor simultaneously. “The AI Will Just Know What to Do”: The model doesn’t magically understand your assessment structure or conversational blueprints. Without explicit control, it skips questions, hallucinates facts, and generally does whatever it wants. “JSON Output Is Reliable”: Ha! The number of times we got malformed JSON or creative interpretations of our schema would make you cry. “Unclear User Answers Will Sort Themselves Out”: When a user gives a vague response, the AI needs a strategy, not permission to improvise endlessly. “Flexibility and Control Are Mutually Exclusive”: This was the big one. We thought we had to choose between rigid scripts and natural conversation. Turns out, you can have both with the right architecture. Our First Attempt (AKA: The Disaster) We did what everyone does first: threw everything at a single LLM instance and hoped for the best. The setup was simple: One big prompt with the assessment script, evaluation criteria, and conversation guidelines all mixed together Ask the model to self-report what questions it asked Use some janky parsing to extract answers from its output Cross fingers and ship it It was a beautiful disaster. The AI invented facts. It skipped questions. It asked random follow-ups that led nowhere. When we asked it to evaluate itself, it was about as reliable as asking a student to grade their own test. And forget about natural conversation flow—it either sounded like a robot reading from a script or went completely off-script. We ran simulations. Only 62% of assessments actually followed the script. Nearly a third failed because the AI just… forgot to ask certain questions. Another 10% failed because it confidently stated “facts” that didn’t exist in the uploaded documents. The guided conversations weren’t any better. The AI would either stick too rigidly to templates (feeling robotic) or wander off into conversational tangents that never accomplished the training goals. We needed a new approach. Badly. The Breakthrough: Stop Trusting the AI The key insight hit us during a particularly frustrating debugging session: We were giving the AI too much power. Think about it—when you train a human coach, you don’t just hand them a manual and say “figure it out.” You give them a structured program, checkpoints, rubrics, and supervision. You also give them flexibility within boundaries. Why were we trusting an AI to do more than we’d trust a human? So we flipped the script entirely: The code would be the boss. The AI would be the worker. The New Mental Model Instead of one monolithic AI brain trying to juggle everything, we built three specialized components working together: The Dialogue Graph Engine: This is the script—an actual graph structure that represents every question, every possible answer path, every decision point, and every conversational blueprint. It lives in our code, not in a prompt. The LLM Task Runner: The AI gets narrow, specific jobs—”extract an answer in this exact format,” “ask this clarifying question,” or “generate a response that hits these conversational beats.” That’s it. No freelancing. The Evaluation Engine: Scoring happens in code using explicit rules. No more asking the AI to judge itself. This separation was everything. Suddenly, we had control and flexibility. How It Actually Works Let me walk you through what happens when a user interacts with the system now: The Dialogue Graph: Your Source of Truth Every assessment is a graph. Each node represents a specific moment in the conversation with: The exact prompt template The expected answer format (strict JSON schema for assessments, flexible for practice) Validation rules (like “year must be between 1900 and 2025”) Node type flags: strict (Knowledge Assessment), flexible (Skills Practice), or blueprint (Guided Prompting) What happens next based on the answer and conversation flow When a user starts, we’re at node 1. They answer, we validate, we move to the next node. It’s deterministic. Repeatable. Auditable. But it’s also smart enough to adapt when needed. The LLM’s Actual Job: Scoped and Focused When we hit a node, the LLM gets a super focused task that varies by mode: For Knowledge Assessment nodes: “Here’s the question. Here are relevant excerpts from the uploaded documents. Extract the answer in this exact JSON format. Nothing else.” For Skills Practice nodes: “You’re a supportive trainer. The user is practicing negotiation. Respond naturally, push them with a follow-up that challenges their approach. Report back which training objectives you covered in this hidden structure.” For Guided Prompting nodes: “Follow this conversational blueprint. You need to cover these three key points, but adapt your phrasing to the user’s communication style. Emit blueprint tokens showing which beats you’ve hit.” We set appropriate token

Extracting Gold from Conversations: The Hidden Challenges of Transcript Analysis

Did you know that analyzing a transcript conversation isn’t straightforward? Well, neither did I! 🤷🏽‍♂️ When I first started building analysis and evaluation products at Insight7, I quickly realized that working with conversational data presented a plethora of challenges that required more than just technical know-how. So grab your favorite cup of coffee, and let’s dive into the gold mine that is transcript analysis! Why Transcript Analysis Is Harder Than It Looks Conversational data is rich with insights but is often messy and unstructured. It may seem like a straightforward process—record a conversation, get a transcript, and voilà! But the reality is far more complicated. Here are some of the hidden challenges: Compartmentalization: There’s no one-size-fits-all approach to transcripts. Different types require different handling. Lack of Numerical Data: Conversations are text-heavy, and extracting quantifiable data is no small feat. Disjointed Transcripts: Sometimes, you’ll encounter transcripts where the information is scattered, making it difficult to analyze. Common Misconceptions About Transcript Analysis Many sales and customer service teams harbor misconceptions about transcript analysis that can lead to missed opportunities. Here are a few: AI Can Do It All: A prevalent belief is that AI can process insights without preprocessing. However, no model performs well with disjointed and unstructured data. All Transcripts Are the Same: Each conversation is unique. For instance, internal calls differ significantly from client calls, requiring separate handling. Readability Equals Accuracy: Just because a transcript looks clean doesn’t mean the insights derived from it are accurate. The system’s interpretation can differ from human understanding. Misunderstanding Quotes: Users often assume that any given quote can represent the data accurately, but the selection and structure matter greatly. Readable Transcripts Guarantee Insights: The assumption that a readable transcript guarantees accurate insights is misleading; the system’s lens of perception plays a crucial role. The Nature of Conversational Data Conversational data is inherently complex. Unlike structured data, which fits neatly into rows and columns, conversations are fluid and often contain nuances that can be easily overlooked. Here are some common problems with raw transcripts: Ambiguity: Names can be misidentified or coded as letters (e.g., ‘A’ for ‘InsightLeader’), complicating analysis. Disorganized Format: From PDFs to voice recordings, the format can vary greatly, impacting how you extract valuable insights. The Core Pipeline: Clean → Process → Identify To tackle the messiness of conversational data, we often follow a core pipeline: Cleaning This is the first step where standard data cleaning procedures come into play. You need to ensure that the text is free from noise—think filler words, background chatter, or irrelevant comments. Processing Once cleaned, the next step is to preprocess the data. This involves segmenting the transcript into coherent parts, making it easier to manage. For instance, separating comments by users allows for clearer analysis. Identification This step involves identifying the speakers and the context of the conversation. Are you dealing with a focus group, a tutorial, or a one-on-one interview? The answer shapes how you approach the analysis. Solving Transcript Problems With Practical Techniques Now that we’ve laid the groundwork, let’s explore some practical techniques for overcoming common transcript challenges: Detecting Conversation Types Identifying call types helps in processing different transcripts effectively. For example, insights gleaned from a focus group can differ significantly from those derived from a tutorial. Using AI + Analysis Models for Metadata Extraction Leveraging AI models allows us to glean essential metadata from conversations—like identifying customers, their company size, or even specific sentiments expressed during the call. Structuring Transcripts With Index Parsing I developed an index parsing approach that manipulates text to create a structured format, making it easier to analyze and retrieve information. Hybrid Named Entity Recognition (NER) A mix of LLMs (Large Language Models) and rule-based methods can tackle the challenge of identifying speakers—even when names are outliers or coded. Handling Disjointed Transcripts Disjointed conversations can be tricky. The best technique I’ve found involves using an LLM to process the entire conversation. While it’s a costly approach, it tends to yield the most accurate results. Real-World Impact of Transcript Analysis In dozens of real-world cases working with Insight7, transcript analysis didn’t just save time — it revealed patterns and opportunities that teams acted on immediately. For example, sales teams discovered that customers were dropping off not because of price, but due to integration and implementation concerns, prompting demos and onboarding changes that boosted close rates. Customer-service operations exposed frustration not with response speed but with repeated handoffs and conflicting answers — leading to the adoption of an owner-agent model and higher CSAT scores. On the coaching front, managers used transcript-driven metrics (like talk ratio, missed value-recaps, failure to “ask next step”) to give precise feedback, resulting in improved call quality and more predictable follow-ups. Product teams even used recurring customer complaints to drive roadmap changes, showcasing how Insight7 makes analyzing interviews faster and more impactful. Can You Extract Goals From Transcripts? Absolutely. With a refined system that adequately identifies various conversation types, we can effectively analyze and evaluate transcripts. This capability empowers CEOs and project managers to make insightful decisions based on their data. How Insight7 Makes This Entire Process Automatic At Insight7, we’ve developed cutting-edge tools that automate the transcription and analysis of conversations in over 60 languages. Here’s how we deliver value: Clear Actionable Insights: We surface recurring themes, sentiment, pain points, and meaningful quotes. Visualization: Our dashboards, journey maps, and scorecards help visualize findings for easy interpretation. Collaboration and Reporting: Designed for product, sales, CX, and research teams, our platform supports collaboration and evidence-based decision-making—all while ensuring enterprise-grade security. Conclusion In sales and customer service, understanding conversations isn’t just about transcripts; it’s about transforming unstructured data into actionable insights. By embracing the challenges of transcript analysis, we can extract the gold nuggets that lie within conversations and drive informed decision-making. Ready to unlock the potential of your conversational data? Join the ranks of successful sales and customer service teams by leveraging Insight7’s powerful tools. Let’s turn your conversations into actionable insights today!

A Week, an Idea, and an AI Evaluation System: What I Learned Along the Way

How the Project Started I remember the moment the evaluation request landed in my Slack. The excitement was palpable—a chance to delve into a challenge that was rarely explored. The goal? To create a system that could evaluate the performance of human agents during conversations. It felt like embarking on a treasure hunt, armed with nothing but a week’s worth of time and a wild idea. Little did I know, this project would not only test my technical skills but also push the boundaries of what I thought was possible in AI evaluation. A Rarely Explored Problem Space Conversations are nuanced; they’re filled with emotions, tones, and subtle cues that a machine often struggles to decipher. This project was an opportunity to explore a domain that needed attention—a chance to bridge the gap between human conversation and machine understanding. What Needed to Be Built With the clock ticking, the mission was clear: Create a conversation evaluation framework capable of scoring AI agents based on predefined criteria. Provide evidence of performance to build trust in the evaluation. Ensure that the system could adapt to various conversational styles and tones. What made this mission so thrilling was the challenge of designing a system that could accurately evaluate the intricacies of human dialogue—all within just one week. What Made the Work Hard (and Exciting) This project was both daunting and exhilarating. I was tasked with: Understanding the nuances of human conversation: How do you capture the essence of a chat filled with sarcasm or hesitation? Developing a scoring rubric: A clear, structured approach was essential to avoid ambiguity in evaluations. Iterating quickly: With a week-long deadline, every hour counted, and fast feedback loops became my best friends. Despite the challenges, the thrill of creating something groundbreaking kept me motivated. The feeling of building something new always excites me—it’s unpredictable, and there was always a chance the entire system could fail. Lessons Learned While Building the Evaluation Framework Through the highs and lows of this intense week, I gleaned valuable insights worth sharing: Quality isn’t an afterthought—it’s a system. Reliable evaluation requires clear rubrics, structured scoring, and consistent measurement rules that remove ambiguity. Human nuance is harder than model logic. Real conversations involve tone shifts, emotions, sarcasm, hesitation, filler words, incomplete sentences, and even transcription errors. Teaching AI to interpret this required deeper work than expected. Criteria must be precise or the AI will drift. Vague rubrics lead to inconsistent scoring. Human expectations must be translated into measurable and testable standards. Evidence-based scoring builds trust. It wasn’t enough for the system to assign a score—we had to show why. High-quality evidence extraction became a core pillar. Evaluation is iterative. Early versions seemed “okay” until real conversations exposed blind spots. Each iteration sharpened accuracy and generalization. Edge cases are the real teachers. Background noise, overlapping speakers, low empathy moments, escalations, or long pauses forced the system to become more robust. Time pressure forces clarity. With only a week, prioritization and fast feedback loops became essential. The constraint was ultimately a strength. A good evaluation system becomes a product. What began as a one-week sprint became one of our most popular services because quality, clarity, and trust are universal needs. How the System Works (High-Level Overview) The evaluation system operates on a multi-faceted, evidence-based approach: Data Collection: Conversations are transcribed and analyzed in over 60 languages. Evaluation on Rubrics: The AI evaluates transcripts against structured sub-criteria using our Evaluation Data Model. Scoring Mechanism: Each criterion is scored out of 100, with weighted sub-criteria and supporting evidence. Performance Summary & Breakdown: Overall summary Detailed score breakdown Relevant quotes from the conversation Evidence that supports each evaluation This approach streamlines evaluation and empowers teams to make faster, more informed decisions. Real Impact — How Teams Use It Since launching, teams across product, sales, customer experience, and research have leveraged the evaluation system to enhance their operations. They are now able to: Identify strengths and weaknesses in AI interactions. Provide targeted training to improve agent performance. Foster a culture of continuous, evidence-driven improvement. The real impact lies in transforming conversations into actionable insights—leading to better customer experiences and stronger business outcomes. Conclusion — From One-Week Sprint to Flagship Product What started as a one-week sprint has now evolved into a flagship product that continues to grow and adapt. This journey taught me that the intersection of human conversation and AI evaluation is not just a technical pursuit—it’s about understanding the essence of communication itself. “I build intelligent systems that help humans make sense of data, discover insights, and act smarter.” This project became a living embodiment of that philosophy. By refining the evaluation framework, addressing the nuances of human conversation, and focusing on evidence-based scoring, we created a robust system that not only meets our needs but also sets a new industry standard for AI evaluation.

Why AI Coaching Scales What Human Coaching Can’t

I think everyone’s having the wrong argument about AI coaching and perhaps even AI in general. The debate centers on replacement: “Can AI replace human coaches?” Speed versus empathy, scale versus intuition. The real question is whether human coaching can see what AI systems see. I believe this answer matters more than the replacement debate. The Coach Who Only Watches One Game This is the more popular point of discourse here, but let’s imagine a basketball coach who only watches the last game their team played. Just the most recent one. That coach gives feedback based solely on that single game. “You’re shooting too many threes,” they say after watching a game where the team went cold from distance. The previous ten games showed the opposite problem, but those games are invisible. This is human sales coaching in 2025. Sales managers coach based on recency bias. They listen to the last call they reviewed, the most recent deal that went sideways, the conversation that’s fresh in their memory. The pattern emerges from what just happened, whether or not that represents reality. Now, imagine a coach watches every single game. He sees that your top performer closes 30% more deals when they ask about budget in the first five minutes. It notices your struggling rep loses deals when discovery runs past 40 minutes. It catches that objections about price actually predict wins because engaged buyers negotiate. Human coaches see snapshots whereas an AI coach sees the time-lapse. The Clone Problem Here’s an uncomfortable truth about human coaching: managers create clones of themselves. When your manager was a rep, they had a style that worked. Maybe consultative. Maybe aggressive. Maybe relationship-driven. That style got them promoted. Now they’re coaching you to sell the way they sold. Their personality, their approach, their techniques are all applied to your territory, your buyers, your situation. This is like a chef who only knows French cuisine teaching you to cook. You’re learning French cooking. What would you do if your customers wanted Thai? AI coaches analyze what actually works for reps with your profile, selling to your type of buyer, in your specific situation. It coaches you to become the best version of yourself based on data from thousands of similar situations. You learn to sell like the top performer in your category, with your strengths, to your buyers. The Favoritism Tax You might not want to admit it, but managers play favorites. The intention is good. Top performers are easier to coach. They ask better questions. They implement feedback faster. They make the manager look good. So they get more coaching time. It’s a rational choice: managers have limited bandwidth, and top performers yield a higher return on that investment. Consequently, struggling representatives receive less attention. This approach leads to a predictable outcome: The sales representatives most in need of coaching are the ones who receive it least. Consequently, the performance gap is exacerbated—the bottom performers remain stalled, while the top performers further elevate their results. This is the coaching world’s version of “the rich get richer.” AI gives every rep the same depth of analysis, the same quality of feedback, the same attention to their specific gaps. The struggling rep gets the same treatment as the star performer. Human coaching compounds advantages. AI coaching levels the field. Coaching Latency Kills Deals Here’s a question: When does coaching actually happen? In most organizations, it goes like this. Rep has a call on Monday. Manager reviews it on Friday (if they review it at all). They schedule a coaching session for the following Wednesday. By the time the rep gets feedback, they’ve already had ten more calls with the same mistake. This is coaching latency. The gap between the mistake and the correction. Consider a tennis coach who observes your poor serving technique but remains silent. Two weeks later, after you’ve repeated the incorrect motion 500 times, they finally point out, “Your serve is off.” By then, the bad habit is deeply ingrained. This scenario illustrates a fundamental problem in coaching today. The latency in sales coaching is enormous. A rep makes an error on an objection early on Monday. By that afternoon, AI has already flagged it. The rep receives feedback before their next call, preventing the mistake from becoming a pattern. The error is compounded zero times instead of ten. Human coaching corrects mistakes after they’ve become habits. AI coaching catches mistakes before they become patterns. The Practice Paradox Elite athletes train constantly. LeBron James has taken millions of practice shots. Serena Williams has hit millions of practice serves. They learned through repetition in safe environments before playing real games. Sales reps learn by playing only real games. Think about that. Every call is live. Every conversation is with an actual buyer who could actually close or walk away. There’s no practice court. No scrimmage. No safe place to fail. Roleplaying scenarios, where a manager acts as the buyer, are a common but often ineffective coaching tool. Because the rep is aware they are playing against their manager, the dynamic is awkward for everyone. The feedback provided is merely one person’s subjective opinion on an artificial interaction, after which the team simply returns to their actual calls. Sales is the only profession where beginners practice on customers. AI creates judgment-free simulation at scale. You can handle the toughest objection 50 times before you face it live. You can practice discovery questions until the structure becomes muscle memory. You can fail privately and learn publicly. Top performers are made through repetition. AI makes that repetition possible without burning through your buyer list. What Humans Can’t See Human coaches listen for specific things. Did the rep ask discovery questions? Did they handle the objection? Did they set next steps? These are coachable moments, and good managers catch them. But there’s a category of insights human coaches literally cannot perceive: Talk speed. Reps who speak 10% slower close 15% more

Your Customer Conversations Are Crude Oil (And Most Companies Are Just Storing Barrels)

I. The Refinery Revolution In 1859, Edwin Drake struck oil in Titusville, Pennsylvania. Within months, entrepreneurs flooded the region, drilling wells and filling barrels with black liquid gold. They sold it as crude petroleum for lamp oil, a single-purpose commodity. Then, in the 1860s, everything changed. Refineries discovered that crude oil wasn’t one product, it was dozens. Through fractional distillation, that same barrel yielded gasoline for engines, kerosene for lamps, lubricants for machines, asphalt for roads, and petrochemicals for plastics. The oil didn’t change but the extraction process did. The wildcatters who kept selling crude were left behind. The fortune went to those who built refineries. Today, most organizations are wildcatters. They’re drilling thousands of customer conversations (sales calls, support tickets, customer interviews, onboarding sessions) and storing them in barrels. Some play them back occasionally. Most sit untouched in digital warehouses, a vast reserve of raw material that never gets refined. Your customer conversations are crude oil. And you’re leaving billions in value unextracted. II. The Illusion of Capture Here’s what most companies think they’ve solved: “We record our calls now. We have conversation intelligence. We’re good.” However, the reality is you’ve built the oil well. You’ve got the storage tanks. You’re capturing the crude. You haven’t built the refinery. Recording a call is like pumping oil into a barrel. Useful? Sure. Valuable? Not yet. That barrel of crude sitting in your storage facility doesn’t power cars, pave roads, or create plastics. It’s just potential energy with absolutely no kinetic value. Traditional conversation intelligence platforms extract one product: sales insights. They transcribe calls, tag keywords, identify objections, track competitor mentions, and feed your sales team coaching moments. It’s basically kerosene – valuable for lighting lamps, but it’s nowhere near the full potential of the barrel. That same conversation contains gasoline for marketing, lubricants for customer experience, asphalt for product development, and petrochemicals for Learning & Development. Every customer conversation is a barrel of crude with a dozen refined products waiting to be extracted, but you’re only processing one. Most organizations don’t have a refining problem. They have a distribution problem. They’re pumping oil and sending 100% to the sales department, while marketing, customer experience, and L&D teams stand empty-handed, wondering why they can’t access the insights they desperately need. III. What’s Actually in the Barrel? Let’s crack open a single sales conversation. Let’s examine a 45-minute discovery call with a prospect evaluating your product. To most companies, this is “a sales call.” To a refinery, it’s multiple products waiting to be extracted. For Sales: Objection patterns, competitive positioning, win/loss signals, deal risk indicators, champion identification. This is what your current system already extracts. It’s important. It’s also just 15% of what’s in that barrel. For Customer Service: Early warning signs of misaligned expectations. Onboarding gaps that create friction before they become support tickets. Expansion signals buried in casual mentions (“We’d love this for the marketing team too…”). Health scores your CSMs can’t see yet but your calls are screaming about. The difference between a customer who renews and one who churns often shows up in sales conversations months before the renewal date. For Marketing: The actual language customers use to describe their pain. Not what your positioning deck claims they say – what they actually say. Brand perception in the wild. Messaging that resonates vs. falls flat. Competitive differentiation that matters to buyers, not what you think matters. Content gaps prospects mention. Your next campaign headline is buried in transcript line 247. And there’s more – L&D teams can extract what top performers say differently, coaching moments that actually work in practice, the tribal knowledge that never makes it into training decks. All of this exists in one conversation. Right now, you extract one product and call it a day. That’s leaving the refinery half-built. IV. Building the Refinery: A New Model So what does a refinery for customer conversations actually look like? The insight that changed how we think about this problem at Insight7 is simple: conversations don’t need to be processed once. They need to be processed in parallel, through multiple specialized lenses simultaneously. Think about how a petrochemical refinery works. Crude oil doesn’t go through a single processing unit. It moves through specialized towers, each designed to extract specific compounds at different temperatures. The kerosene tower operates at 150-250°C. The diesel tower at 250-350°C. The lubricant tower above that. Same crude, different processes, different outputs. Customer conversations need the same architecture. Instead of running conversations through one analysis pipeline that tries to extract “insights” for everyone, you need parallel processing systems, each optimized for a specific team’s needs. The Sales Refinery asks: “What does this conversation tell us about closing this deal?” The Customer Service Refinery asks: “What does this tell us about this customer’s health and trajectory?” The Marketing Refinery asks: “What does this tell us about how the market sees us?” These are fundamentally different questions. You can’t answer all of them with one analysis. Here’s where most organizations get the model wrong. They think the goal is to “democratize access” to conversations. So they build a platform everyone can log into, search transcripts, listen to calls. That’s not a refinery. That’s a warehouse full of barrels with a search function. The refinery model recognizes that each team needs their refined product delivered directly, not access to the crude. Your marketing team doesn’t need to listen to 47 sales calls to understand messaging performance. They need a synthesis of what’s working across those 47 calls. Your product team doesn’t need transcript access. They need clustered feature requests with frequency data. Think of it as a distribution network. Gasoline goes to gas stations. Jet fuel goes to airports. Lubricants go to manufacturing plants. Each gets their refined product, in their format, where they work. This is what we’ve built at Insight7. Not just parallel processing, but parallel learning. Each refinery gets smarter about what matters for its specific function. The sales refinery

Category: Our Perspectives