QA managers and contact center operators building a QA scorecard from scratch face a decision that shapes the entire program: whether to use generic industry templates or design criteria from the ground up based on actual call behavior.

Templates produce scorecards that measure what the industry thinks matters. Custom scorecards measure what your customers and business model actually require. This guide covers how to create an effective QA scorecard from scratch in 2026.

Why Generic Scorecards Fail

The most common failure mode in new QA programs is borrowing a scorecard from a template without calibrating it to actual call behavior. A hospitality company and an insurance company should not use the same QA criteria, even if both run inbound support lines.

Generic criteria lead to scoring that does not correlate with customer outcomes. Teams score well on compliance with a template and still see declining CSAT, because the template measures the wrong things. Building from scratch forces alignment between what you score and what actually matters.

What criteria should be included in a QA scorecard?

The most effective scorecards include four categories: compliance criteria (things reps must say or do for legal or policy reasons), process criteria (things reps should follow for consistency), quality criteria (things that differentiate good from acceptable interactions), and outcome criteria (whether the customer's issue was actually resolved). Most scorecards over-index on compliance and process and under-index on quality and outcome.

Step 1: Define What "Good" Looks Like From Actual Calls

Before writing a single criterion, pull 20-30 recent calls that your most experienced managers would call "excellent" and 20-30 they would call "poor." Listen to both sets. The differences you hear are your criteria.

This step cannot be skipped. Criteria written from memory or from templates reflect assumptions about what good looks like, not what it actually sounds like. The language, the behaviors, the specific moments where calls go well or poorly — these patterns only emerge from real calls.

Document what you notice in plain language: "Rep asked an open-ended question to understand the full problem before proposing a solution" or "Rep gave the customer a specific next step and confirmed understanding at the end."

Step 2: Structure Criteria with Context

Each criterion needs three elements to produce reliable scoring: the criterion name, the scoring description (what earns a passing score), and context definitions (what "great" looks like versus what "poor" looks like).

The context column is the element most scorecards omit. Without it, two QA reviewers will score the same call differently. The context makes the standard explicit rather than assumed.

Example:

Criterion: Opening rapport
Scoring description: Rep greets customer by name and establishes a warm, professional tone in the first 30 seconds.
What "great" looks like: Rep uses customer's first name naturally, matches their communication pace, and makes the customer feel heard before addressing the issue.
What "poor" looks like: Rep reads from a script with flat intonation, does not acknowledge the customer's name, or moves directly to problem resolution without greeting.

Insight7 uses this exact structure: main criteria, sub-criteria, descriptions, and a context column defining what great and poor look like. Weights are configurable and must sum to 100%. This structure is what enables automated scoring to align with human judgment.

Step 3: Assign Weights and Test for Calibration

Not all criteria are equally important. A missed required disclosure is more serious than a slightly rushed closing. Assign weights that reflect the business priority of each criterion.

As a starting structure: compliance criteria typically carry 40-50% of total weight in regulated industries, process criteria 20-30%, quality criteria 20-30%. Outcome criteria, when scored, often carry the highest individual weight per item.

After assigning weights, run a calibration exercise. Have two or three experienced reviewers score the same 10 calls independently. Compare scores. Where they disagree by more than one rating level, review the criterion definition — the ambiguity is in the criteria language, not in the reviewers.

Initial calibration to match human judgment typically takes 4-6 weeks for complex operations. Teams that try to shortcut calibration end up with scores that diverge from human judgment, which undermines trust in the scoring system.

Step 4: Choose Script-Based or Intent-Based Evaluation

For each criterion, decide whether you are checking verbatim compliance or intent. This distinction matters especially when using AI-assisted scoring.

Script-based: The rep must say a specific phrase or follow a specific sequence. "Our calls may be recorded for quality and training purposes" is a compliance disclosure that must be said, not approximated.

Intent-based: The rep must demonstrate a behavior, and multiple phrasings qualify. "Building rapport" can be accomplished many ways. Intent-based evaluation scores the underlying behavior, not a specific script.

Most effective QA scorecards use a mix: script-based for compliance items, intent-based for quality and relationship items. Insight7 supports both modes per criterion, which is what allows a single scorecard to handle both regulatory and quality evaluation without forcing everything through keyword matching.

Step 5: Build the Alert Layer

A scorecard without an alert system requires manual review to find problems. Define which criteria violations trigger real-time alerts and what the severity levels are.

Critical alerts (hang-ups, required disclosure omissions, prohibited language) should notify supervisors immediately. Warning alerts (score below threshold, incomplete process steps) can aggregate into daily reports. Informational alerts (quality patterns worth noting) can feed weekly team reviews.

The alert layer converts the scorecard from a retrospective measurement tool into an active monitoring system.

If/Then Decision Framework

  • If you are starting with no prior QA program: begin with 8-12 criteria, calibrate thoroughly, then add criteria once the base is reliable.
  • If your scores don't correlate with CSAT: review whether your quality and outcome criteria are weighted appropriately relative to compliance and process.
  • If human reviewers disagree frequently: the criteria language is ambiguous. Add context definitions until inter-rater reliability improves.
  • If you have regulatory disclosure requirements: make those criteria script-based and weight them at 20%+ of total score.
  • If you want to scale beyond manual review: use Insight7 to automate scoring against your configured criteria at 100% call coverage.

How do you get buy-in from agents and managers for a new QA scorecard?

Involve both groups in the criteria-building process. Agents who help define what "excellent" looks like on a call are more likely to trust the resulting scorecard. Managers who see their own standards reflected in the criteria will defend the system when agents challenge individual scores. Show calibration data — the evidence that the scorecard scores consistently — before rolling out widely.

FAQ

How many criteria should a QA scorecard have?

Most effective scorecards have 8-15 criteria. Fewer than 8 often fails to distinguish performance dimensions meaningfully. More than 15 increases reviewer burden, reduces consistency, and makes coaching conversations harder to focus. If you find yourself with 20+ criteria, consolidate overlapping items or move some to a separate supplemental scorecard for specific call types.

How do you handle calls that don't fit the standard scorecard?

Complex escalations, language barrier calls, and technical edge cases often don't score well on a standard scorecard even when the rep performed well. Solutions include: creating a separate scorecard for identified call types with different criteria weightings, adding a "context override" field where reviewers can flag and explain atypical scoring, and excluding certain call types from aggregate trend calculations.


A QA scorecard built from actual call behavior, with structured context definitions and calibrated weights, produces scores that teams trust. Insight7 automates scoring against configured criteria at 100% coverage, turning a well-built scorecard into a scalable monitoring and coaching system.