Glossary

Acoustic Echo Cancellation (AEC)

When a device both plays sound and listens at the same time, it risks hearing itself. Acoustic Echo Cancellation (AEC) solves this by removing speaker output from microphone input in real time. Without it, voice systems can spiral into feedback loops, repeating their own speech.

In practice, AEC is essential for devices like phones, earbuds, and smart speakers. Modern systems use adaptive filtering to continuously model how sound travels from speaker to mic. However, distortion—especially from small, overdriven speakers—can confuse these models, leaving traces of echo behind.

Audio Graph

Think of an audio graph as a flowchart for sound. It’s a network of small, specialized components (nodes), each handling one task—like capturing audio, reducing noise, transcribing speech, or generating responses.

This modular design makes voice systems flexible. Want to swap a speech model or add a new feature? Just replace or insert a node instead of rebuilding everything. Audio graphs are what make experimentation and rapid iteration possible in modern voice AI systems.

Automatic Speech Recognition (ASR)

ASR is what turns spoken words into text. It’s the entry point for most voice systems—everything downstream depends on how well it performs.

There are two primary architectural approaches: batch (waits until speech ends) and streaming (transcribes as you speak). Streaming feels faster but can introduce mistakes mid-sentence. Accuracy is measured using Word Error Rate (WER), and even small errors can cascade into poor responses. In short: better input audio and better models lead to better conversations.

Barge-In

Barge-in is what happens when a user interrupts a system mid-response—and how the system handles it matters more than you’d expect.

A good system stops speaking almost instantly and shifts attention to the user. A bad one keeps talking, creating frustration.

Technically, this requires constant listening, fast speech detection, and immediate audio shutdown. The key metric here is barge-in latency—the delay between user speech and system silence. Even small delays can make an otherwise fast system feel broken.

Bit Rate

Bit rate measures how much audio data is transmitted per second, usually in kilobits per second (kbps). It directly affects both sound quality and bandwidth usage.

Higher bit rates preserve more detail, improving transcription accuracy and speech naturalness. Lower bit rates reduce data usage and latency but sacrifice clarity.

Choosing the right bit rate is a balancing act—too high wastes resources, too low degrades performance. In voice AI, this decision often depends on network conditions and application requirements.

Buffer

A buffer is temporary storage for audio data as it moves through a system. It helps smooth out timing mismatches between recording, processing, and playback.

Buffer size directly affects performance:

Larger buffers = more stability but more delay
Smaller buffers = faster response but risk of glitches

In real-time voice systems, tuning buffers is critical. Too much delay breaks conversation flow; too little causes audio dropouts. It’s one of the most important—and often overlooked—latency controls.

Call Center Automation

Call center automation replaces or augments human agents with voice AI systems that can handle real conversations. Unlike traditional phone menus, these systems understand natural language and can complete tasks end-to-end.

They must integrate with telephony systems, backend APIs, and escalation workflows. The real advantage comes from flexibility—teams can update behavior or swap models without rebuilding the system. As AI improves, these systems are rapidly becoming the default for high-volume customer interactions.

Component Pipeline

A component pipeline splits voice AI into three stages: speech-to-text (ASR), reasoning (LLM), and text-to-speech (TTS). Each runs independently.

This modularity makes systems easy to update and customize. But there’s a cost: latency adds up across each step, and important vocal cues—tone, hesitation—are lost once audio becomes text.

It’s a practical, widely used architecture, but increasingly being challenged by newer approaches that keep audio intact throughout processing.

Context Window

The context window defines how much information a model can consider at once—everything from conversation history to system instructions.

As conversations grow, this space fills up, increasing processing time and forcing tradeoffs. Systems often deal with this by summarizing older content or trimming less relevant parts.

A larger context can improve understanding but also increases cost and latency. Designing around this limitation is one of the key challenges in building scalable conversational systems.

Convolutional Neural Network (CNN)

CNNs are neural networks designed to detect patterns in structured data. In voice AI, they’re often applied to spectrograms—visual representations of sound.

They excel at identifying features like phonemes, keywords, or noise patterns without manual feature engineering.

While newer architectures dominate large-scale models, CNNs remain important for tasks like keyword spotting and audio classification, where efficiency and speed matter.

Cross-Platform Deployment

Cross-platform deployment means building a voice system once and running it across devices—phones, desktops, embedded hardware—without rewriting everything.

This is harder than it sounds. Each platform handles audio differently, from buffering to hardware access.

A good abstraction layer hides these differences, letting developers focus on features instead of platform quirks. Without it, maintaining separate implementations quickly becomes unmanageable.

Deep Neural Network (DNN)

A deep neural network is simply a neural network with many layers, allowing it to learn complex patterns.

In voice AI, DNNs power everything from speech recognition to speech synthesis. Their depth enables them to capture subtle acoustic and linguistic relationships that simpler models miss.

They’re the foundation of modern AI systems, but their performance depends heavily on training data, architecture, and compute resources.

Digital Signal Processing (DSP)

DSP is the math that cleans up audio before AI models ever see it. It includes noise reduction, echo cancellation, and signal enhancement.

Good DSP dramatically improves transcription accuracy—bad audio leads to bad results downstream.

In real-world environments (cars, factories, homes), DSP isn’t optional. The challenge is doing all this processing in real time without adding noticeable delay.

Edge / On-Device Inference

Edge inference means running AI models directly on a device instead of the cloud.

This reduces latency, improves privacy, and enables offline use. But it comes with constraints—models must be smaller and more efficient.

The tradeoff is clear: cloud models are more powerful, but on-device models are faster and more secure. Many systems now combine both approaches.

Endpointing

Endpointing decides when a user has finished speaking. It builds on basic voice detection by adding timing and context awareness.

Get it wrong, and the system either interrupts users or leaves awkward silence.

Modern approaches combine acoustic signals with language understanding to improve accuracy. It’s a subtle feature, but it has a huge impact on how natural a conversation feels.

Formant Preservation

When you change the pitch of a voice, you risk making it sound unnatural. Formant preservation fixes this by maintaining the voice’s tonal characteristics.

Without it, voices can sound cartoonish or distorted. With it, pitch changes feel realistic and human-like.

It’s a key technique in voice transformation systems, especially for real-time applications.

Frame

Every audio pipeline chops a continuous sound stream into small chunks—called frames—for processing. Frame size is simply how long each chunk is, measured in milliseconds.

Smaller frames mean faster, more responsive processing but demand more compute per second. Larger frames are more efficient but add delay. Most voice AI systems land somewhere between 10 and 30 milliseconds—a sweet spot that balances responsiveness with practical hardware demands.

Frequency

Frequency is the rate at which a sound wave completes one full cycle per second, measured in hertz (Hz). It's essentially what we perceive as pitch—low frequencies sound deep, high frequencies sound bright. Human speech sits roughly between 80 Hz and 8 kHz, though most telephone systems narrow that to 300–3,400 Hz. In voice AI, understanding frequency content informs everything from microphone selection to filter design to why certain ASR models perform better on some audio sources than others.

Hertz

Hertz (Hz) is the unit of frequency—one cycle per second. In audio, you encounter it in two distinct contexts: as a measure of pitch (440 Hz is the musical note A4) and as a measure of sample rate (16,000 Hz means 16,000 audio snapshots captured every second). Voice AI developers run into hertz constantly—when specifying audio formats, configuring DSP filters, or checking that a microphone's output matches what an ASR model expects as input.

Hybrid Cloud / On-Device Architecture

A hybrid architecture splits processing between the device and the cloud, routing each task to wherever it runs best. Latency-sensitive or privacy-critical stages run locally on the device; heavier computation runs in the cloud.

This flexibility makes hybrid deployments attractive for consumer hardware and regulated industries alike—but it comes with real design complexity. The boundary between environments needs to be explicit, handoffs need to be fast, and the system needs to degrade gracefully if one side becomes unavailable.

Inference Latency

Inference latency is how long a model takes to go from receiving input to producing its first output. In voice AI, the LLM typically dominates this figure—the gap between receiving a transcript and returning the first token is usually the largest single delay in the system.

This number isn't fixed: server load, context length, and model size all affect it. Mean figures can be misleading; p95 and p99 measurements better reflect what users actually experience. Techniques like quantization and speculative decoding are primarily tools for bringing this number down.

Interactive Voice Response (IVR)

IVR is the older generation of phone-based automation—the "press 1 for billing" systems most of us have navigated with varying degrees of patience. Callers follow a fixed decision tree; anything outside the expected inputs hits a dead end.

IVR is the direct predecessor to modern voice AI in telephony, and the contrast is stark. Where IVR forces callers to adapt to the system, voice AI agents understand natural language and adapt to the caller. Replacing IVR with conversational voice AI is now one of the primary drivers of enterprise investment in the space.

Interruption Handling

What happens when a user talks over the AI mid-sentence? That's interruption handling—and it's one of the clearest signals of how polished a voice system really is.

A well-designed system stops speaking immediately, resets, and listens. A clumsy one keeps talking or stumbles awkwardly. Getting this right requires tight coordination between audio playback, speech detection, and the AI's reasoning loop. It's less about raw speed and more about making the interaction feel respectful and human.

Jitter

Audio arrives in packets—and in real networks, those packets don't always show up on time. Jitter is the variability in that timing: the difference between when audio is expected and when it actually arrives.

A little jitter is invisible. Too much causes choppy playback or gaps in transcription. Jitter buffers help by holding incoming audio briefly before playback, smoothing out the bumps—but they add a small delay in exchange. It's another classic latency-versus-stability tradeoff.

Keyword Spotting

Keyword spotting is how always-on devices wake up without burning through battery or compute. Instead of running full speech recognition continuously, the device runs a lightweight model listening for a single trigger phrase—"Hey Siri," "OK Google," and so on.

When the keyword is detected, the full pipeline activates. This two-stage approach keeps resource usage minimal during idle periods. The main challenge: reducing false positives (waking up when you shouldn't) without increasing false negatives (missing the actual trigger).

Language Translation Agent

A language translation agent is a voice AI system that performs real-time spoken translation between languages during live conversation—listening in one language and speaking in another without requiring either party to pause.

Building a production-quality translation agent requires integrating ASR, neural machine translation, and TTS in a pipeline optimized for speed, while preserving the speaker's intent across languages. Use cases range from multilingual customer service and healthcare intake to live event interpretation.

Large Language Model (LLM)

The LLM is the brain of a voice AI system. Once speech is transcribed into text, the LLM reads it, reasons about it, and decides what to say back.

Modern LLMs can handle nuanced questions, multi-turn conversations, and complex tasks—but they're not instant. Inference takes time, and in a voice pipeline, that time adds directly to the delay the user feels. Faster, smaller models reduce latency; larger models tend to reason better but respond slower. Choosing the right model is always a tradeoff.

LLM Node

In an audio graph architecture, an LLM node is a discrete processing unit that receives text input — typically from an ASR node — sends it to a language model, and passes the response downstream for synthesis. It encapsulates the LLM integration within the graph, exposing the same interface as any other node so it can be swapped or tested independently.

This abstraction decouples model selection from the rest of the pipeline. Teams can experiment with hosted APIs, open-source models run via frameworks like llama.cpp, or fine-tuned variants without modifying the surrounding graph. LLM nodes using local inference engines enable fully offline voice AI operation.

Mel-Frequency Cepstral Coefficients (MFCCs)

MFCCs are a compact set of numbers that describe the tonal character of a short audio slice in a way that mirrors how humans perceive sound. They were the dominant feature representation in speech processing for decades, and remain foundational context for understanding how audio AI evolved.

The process converts a frame of audio into a frequency spectrum, applies mel scaling to match human pitch perception, and produces a small set of coefficients capturing the spectral shape that distinguishes one phoneme from another. In modern voice AI, mel spectrograms fed directly into neural networks have largely replaced them—but understanding MFCCs helps explain why.

Multi-Turn Conversation

A single voice exchange is easy. A multi-turn conversation—where each message builds on what came before—is where things get genuinely interesting and genuinely hard.

The system needs to remember context, track what was said, and understand references like "the second option" or "that one." This is managed through the context window, which holds the running conversation history. As conversations grow longer, so does the cost of processing them—making efficient context management one of the quieter engineering challenges in voice AI.

Natural Language Understanding (NLU)

ASR turns speech into text. NLU figures out what that text actually means. It extracts the user's intent (what they want to do) and any relevant details—dates, names, locations, preferences—needed to act on it.

In modern LLM-based systems, NLU is often implicit: the model simply understands. But in more structured pipelines, NLU is a discrete step that classifies intent and extracts data before passing it downstream. Either way, it's what separates a system that hears words from one that actually understands them.

Neural Network

Neural networks are the underlying engine behind almost everything in modern voice AI. Loosely inspired by biological neurons, they're computational systems made of layers of interconnected nodes that learn patterns from data.

Feed them enough examples of speech, and they learn to recognize it. Feed them enough text, and they learn to generate it. The "deep" in deep learning just means many layers—and more layers generally means the ability to capture more complex patterns, at the cost of more compute and more training data.

Noise Cancellation

Real conversations rarely happen in quiet rooms. Noise cancellation is what lets voice AI function in cars, kitchens, offices, and everywhere else life actually happens.

It works by identifying and separating background sounds—fans, traffic, keyboard clicks—from the speaker's voice. Modern approaches use neural networks trained on thousands of audio environments, making them far more effective than older rule-based filters. The goal is to deliver clean speech to the ASR model, because even the best transcription system performs poorly on noisy input.

Noise Reduction

Noise reduction removes unwanted background sounds from an audio signal before it reaches the speech recognition model. In voice AI, it typically runs as a preprocessing step that improves transcription accuracy.

The quality of noise reduction has an outsized effect on real-world robustness. A voice AI that performs well in a quiet testing environment may degrade significantly in a vehicle or on a noisy call. Neural noise suppression models significantly outperform traditional DSP-based approaches, particularly for non-stationary noise sources that change dynamically over time.

Offline Audio Processing

Offline audio processing operates on a complete audio file provided before processing begins. Because the full signal is available from the start, algorithms can use future context—making decisions informed by what comes later in the recording.

This enables higher-accuracy results for tasks like transcription, speaker diarization, and audio enhancement. The tradeoff is that it can't be used for live or interactive applications. If you're transcribing a recorded meeting, offline processing is the right call; if you're building a live voice assistant, you'll need a different approach.

Online Audio Processing

Online audio processing handles audio as it arrives, working on a continuous stream in chunks without access to future signal content. Each chunk is processed as it comes in, meaning decisions must be made with only past and present context.

This approach is necessary for real-time applications but introduces constraints: algorithms must be causal, and accuracy on any given chunk may be lower than what offline processing could achieve. It's the foundation of all live voice AI—the price of real-time is working with incomplete information.

On-Premise Deployment

On-premise deployment means running voice AI infrastructure on servers within an organization's own facilities. Audio and data stay inside the organization's network and never travel to external providers.

This is required in environments where regulations, data sovereignty requirements, or contractual obligations prohibit sending audio externally—classified communications, healthcare, financial services, or enterprises with strict data residency rules. Building a fully on-premise pipeline means self-hosting ASR, LLM, and TTS models, which often involves accepting some capability or latency tradeoffs compared to large cloud-hosted alternatives.

Packet

When audio travels over a network, it doesn't flow as a continuous stream—it's broken into small, labeled chunks called packets, each carrying a slice of audio data along with headers that identify its order and destination.

In voice AI systems that rely on network transmission—cloud ASR, VoIP calls, streamed TTS—packets are the fundamental unit of transport. Packet loss, reordering, and jitter are constant concerns: even a small percentage of dropped packets can degrade transcription accuracy or introduce audible gaps in synthesized speech.

Paralinguistics

Paralinguistics involves the nonverbal aspects of spoken language. It includes vocal features like tone, pitch, speed, rhythm, emphasis, and pauses that add meaning beyond the actual words. These cues can reveal a speaker’s emotions, level of confidence, uncertainty, or intentions—details that are often lost in a written transcript.

In pipeline architectures, this information is discarded at the ASR stage—the LLM receives text only. Speech-native models, which reason over audio directly, preserve these cues, enabling responses that account not just for what was said, but how it was said. That distinction matters most in emotionally sensitive or ambiguous interactions.

Pitch Shifting

Pitch shifting modifies the fundamental frequency of an audio signal—raising or lowering a voice's perceived pitch—without changing its duration. Applied in real time, it transforms how a speaker sounds to others during a live call or recording.

Pitch shifting is a core capability in voice modification features across gaming, social, and entertainment applications. Real-time implementations must balance transformation quality against processing delay, since even small amounts of added latency are perceptible in live conversation. When combined with formant preservation, the result sounds natural; without it, the output tends to sound artificially processed.

Prosody

Prosody is everything beyond the words themselves—rhythm, stress, intonation, and pacing. It's what turns "fine." and "fine?" into completely different messages.

For voice AI, prosody matters in two directions. On input, detecting it can reveal emotion, urgency, or uncertainty. On output, generating natural prosody is what separates robotic-sounding TTS from speech that actually feels human. It's one of the hardest aspects of speech synthesis to get right, and one of the most noticeable when it goes wrong.

Real-Time Audio Processing

Real-time audio processing is the manipulation of audio signals with low enough latency that the output can be used in live, interactive contexts — a conversation, a call — without perceptible delay. In practice, this means processing audio in small chunks, on the order of milliseconds.

Real-time constraints shape every architectural decision in voice AI. They determine the maximum model size usable on given hardware, the buffering strategies available, and the acceptable complexity of DSP preprocessing. Systems that cannot meet real-time constraints introduce latency that disrupts conversational flow — making interactions feel broken even when the underlying responses are accurate.

Real-Time Factor (RTF)

Real-Time Factor is a simple but important benchmark: it measures how long a model takes to process audio relative to the duration of that audio. An RTF of 1.0 means processing takes exactly as long as the audio itself. An RTF below 1.0 means the model runs faster than real time—which is the requirement for live voice applications.

If RTF creeps above 1.0, the system can't keep up and delay accumulates. It's a useful early-warning metric for catching performance issues before they become user-facing problems.

Sample

A sample is a single numerical measurement of an audio signal's amplitude at one instant in time. Digital audio is built from a sequence of samples captured at regular intervals; played back at the correct rate, these discrete values reconstruct a continuous sound wave.

In voice AI, a sample is the atomic unit of audio data—every downstream operation, from DSP filtering to neural inference, ultimately operates on sequences of samples. It's the smallest building block of everything a voice system hears or produces.

Sample Rate

Sample rate is how many times per second an audio signal is measured and recorded, expressed in hertz (Hz). Higher sample rates capture more acoustic detail—CD audio runs at 44,100 Hz, while many voice AI systems work at 16,000 Hz, which is sufficient for speech and far more efficient.

Mismatched sample rates between components can cause subtle audio quality issues or outright failures. Making sure every part of the pipeline agrees on sample rate is one of those foundational details that's easy to overlook and surprisingly painful to debug.

Speaker Diarization

When multiple people are talking, speaker diarization is what labels who said what. Rather than producing one undifferentiated transcript, it segments the audio and tags each segment by speaker—"Speaker 1," "Speaker 2," and so on.

This is crucial for meeting transcription, call center analytics, and any scenario where tracking individual voices matters. It's a technically demanding task, especially when voices overlap, speakers have similar tones, or audio quality is poor.

Speaker Verification

Speaker verification answers a specific question: is this person who they claim to be? It compares an incoming voice sample against a stored voiceprint and returns a confidence score.

Unlike speaker identification—which picks a voice out of a group—verification is a one-to-one comparison. It's used in banking, healthcare, and secure voice interfaces where identity matters. Performance degrades in noisy environments or when the voice sample is very short, which is why real-world deployments usually require a minimum phrase length.

Spectrogram

A spectrogram is a visual representation of audio—a map that shows how frequency content changes over time. The horizontal axis is time, the vertical axis is frequency, and brightness or color represents intensity.

Voice AI models often treat audio as images, feeding spectrograms into visual-style neural networks. This approach has proven remarkably effective: patterns that are hard to describe mathematically are easy for CNNs to learn visually. When you see an AI "reading" audio, it's often quite literally looking at a picture of sound.

Speech-Native Model

A speech-native model processes and generates audio directly, without an ASR transcription step in between. The model receives audio as input and reasons over the full acoustic signal—including tone, pacing, and prosody—rather than a text approximation of it.

Eliminating the ASR stage removes both a source of latency and a point of error propagation. The practical tradeoffs: these models are large, infrastructure for them is less mature, and real-world latency gains depend heavily on deployment quality. But the architectural advantage is structural—they preserve information that text-based pipelines permanently discard.

Streaming

Streaming is what makes voice AI feel fast. Instead of waiting for each stage to finish before passing results forward, a streaming pipeline sends partial outputs as they become available—ASR emits a partial transcript while the user is still speaking, the LLM starts reasoning before transcription completes, and TTS begins synthesizing before the full response exists.

Each overlap shaves time off the total delay. In a fully streaming pipeline, the time to first audio approaches the LLM inference time alone, rather than the sum of every stage. The tradeoff: partial inputs are inherently less certain, and acting on them too early can mean revising or discarding work already done.

Text-to-Speech (TTS)

Text-to-speech converts written text into spoken audio. In a voice AI component pipeline, it is the final stage: the language model's text response is passed to a TTS model, which synthesizes the audio the user hears.

Modern neural TTS produces speech increasingly difficult to distinguish from human recording, though unusual words, strong emotion, and long sentences can still expose artifacts. TTS models vary significantly in naturalness, latency, speaker options, and language support. Streaming TTS—which starts synthesizing before the full response is written—is especially important for keeping perceived response time low.

Time to First Audio (TTFA)

TTFA measures the gap between when the user finishes speaking and when they hear the system's first word. It's the single most important latency metric in voice AI, because it's what users actually feel.

TTFA accumulates delay across every stage — VAD, ASR, LLM inference, TTS synthesis, and audio output buffering. The table below reflects generally accepted perceptual thresholds:

TTFA	User Perception	Conversational Impact
< 200ms	Instantaneous	Human-like; users may overlap speech naturally
200ms – 600ms	Snappy and natural	Responsive; most users perceive no meaningful delay
600ms – 1,000ms	Noticeable delay	Acceptable for task-oriented use; feels AI-like
> 1,000ms	Disjointed	Users often start repeating themselves or barge in

Tool Calling

Tool calling is what turns a voice AI from a conversationalist into an agent that can actually get things done. It lets the language model reach out to external systems—databases, calendars, APIs—mid-conversation, retrieve or act on real data, and incorporate the results before responding.

Asking the AI to book an appointment or pull up an account balance only works because of tool calling. The catch: each external call adds latency. Well-designed systems manage this with non-blocking calls and natural-sounding filler responses that buy a second or two while waiting for results.

Turn-Taking

Human conversation has a subtle rhythm, we use falling pitch, slowing pace, and brief pauses to signal that we're done speaking. Turn-taking in voice AI is the attempt to replicate that rhythm computationally.

Done well, the system feels like a natural conversation partner: it waits the right amount of time, doesn't cut you off, and responds promptly when you've genuinely finished. Done poorly, it either interrupts constantly or leaves uncomfortable silences. Even when the underlying AI is excellent, poor turn-taking makes the whole system feel broken.

Voice Activity Detection (VAD)

VAD is the first thing that runs when you speak—and it keeps running the entire time you're not. It monitors incoming audio continuously, signaling to the rest of the pipeline when speech is present and when it's stopped.

Tuning VAD is a balancing act: too aggressive and it triggers on background noise or truncates mid-sentence pauses; too conservative and it adds dead air at the end of every exchange. Natural pause patterns also vary across languages and individuals, which means a VAD tuned on one dataset may not generalize well to others.

Voice AI / Voice Agent

Voice AI is the broad category—any system capable of spoken, natural-language dialogue. A voice agent is the deployed version: a specific system built to handle a real job, whether that's answering customer service calls, supporting healthcare intake, or managing logistics workflows.

What separates modern voice agents from older phone menu systems is genuine language understanding. Where IVR forced callers into fixed paths, voice agents can handle open-ended questions, follow conversational threads, and respond dynamically. They can run on virtually any device with a microphone—phones, wearables, embedded hardware, vehicles—and the list of viable use cases keeps growing.

Voice Changer

A voice changer modifies a speaker's voice in real time, transforming pitch, timbre, or character as audio is captured. The applications range from gaming and entertainment to privacy-sensitive communications.

In software, voice changers are typically built as audio graphs: microphone input flows through transformation nodes—pitch shifting, formant adjustment, and so on—before reaching the output. Real-time performance is non-negotiable; even a few milliseconds of added latency is perceptible in live conversation. Getting transformation to sound natural while also running fast is the core engineering challenge.

Voice-Controlled Device

A voice-controlled device is any piece of hardware where speech is a primary way to interact—smart speakers, earbuds, headsets, watches, cars, and dedicated embedded systems.

Building voice AI for hardware introduces constraints that browser or app development doesn't face: direct audio device access, acoustic tuning for specific form factors, and real-time processing within tight power and memory budgets. The on-device versus cloud question also becomes more pressing here—network connectivity may be unreliable, users may have strong privacy expectations, and even a few hundred milliseconds of cloud round-trip latency can undermine the experience.

Word Error Rate (WER)

WER is the standard way to measure transcription accuracy. It counts how many insertions, deletions, and substitutions are needed to turn the ASR output into the correct transcript, then divides by the total number of words in the reference.

A score of 0% is perfect; the metric has no upper bound—a model that hallucinates extra words can exceed 100%. More importantly, a low WER on a clean benchmark doesn't guarantee good performance on real-world audio with accents, background noise, or domain-specific vocabulary. ASR errors propagate: a misheard word becomes wrong input to the LLM, which can send the entire response off track.

This glossary covers terminology relevant to the design, development, and deployment of voice AI systems. For practical examples and implementation guides, see our Engineering Hub.