Voice AI Engineering

I built a phone number you can call and argue with an AI — here's the part nobody tells you

Voice AI Telephony LLMs Engineering

Audience: engineers and the people who hire them. ~10 min read.

I wanted one thing: dial a regular phone number and have an AI support agent pick up and actually help. Pull from a knowledge base, book an appointment, sound like a person. The text-chat version of this is a solved problem now. The phone version is where the interesting engineering hides, because a phone call is a real-time, full-duplex audio stream, the model in the middle is slow, and the transcription is noisy enough that you can't treat it as authoritative.

This is the story of building voice for TeaVoice, an AI customer-support platform. I'll show you the path I took, the wall I hit, and the four problems that don't exist in chat and absolutely do exist on a phone.

First, a quick glossary so the rest reads clean:

  • PSTN: the regular phone network. Actual calls, not app-to-app.
  • DID: the phone number people dial.
  • Telnyx: my telephony provider. It bridges the phone call to my server.
  • Webhook: Telnyx HTTP-POSTs my server when something happens on the call.
  • Media stream: Telnyx sends me the raw audio, live, instead of a transcript.
  • STT / TTS: speech→text and text→speech.
  • VAD: voice activity detection. Figuring out when the caller stopped talking.

Attempt 1: let the phone company do the hard part

The obvious first move is to let Telnyx handle speech. They have an API for it. The loop looks clean on a whiteboard:

flowchart TD
    A[Answer the call] --> B[Speak the greeting]
    B --> C[Wait for 'speak finished']
    C --> D[Start transcription]
    D --> E[Caller talks]
    E --> F[Transcription webhook arrives]
    F --> G[Stop transcription]
    G --> H[Run it through the AI]
    H --> I[Speak the reply]
    I --> C

Every box from "start transcription" to "transcription webhook" is Telnyx's to own. That's the part that bit me.

I built that. Then I spent a genuinely humbling number of hours discovering that the provider's transcription has trapdoors:

  • One transcription engine returns 200 OK and then sends zero transcription events. Forever. No error. It just silently does nothing.
  • The moment I added a config option to pick a better transcription model, the whole thing went quiet again. Same 200 OK, still no events.
  • Their built-in "AI assistant" feature can only be started once per call, so you can't use it to drive a turn-by-turn conversation with your own logic.
  • And the speech recognition keeps transcribing the agent's own voice as if the caller said it, so you have to choreograph exactly when you start and stop listening around when you're talking.

It wasn't that the provider was bad. It was that the more of the audio pipeline I handed off, the less I could control the two things that actually matter: latency and correctness. I was tuning a black box.

So I stopped asking the phone company to listen for me.


Attempt 2: take over the audio

The better path: have Telnyx fork the raw audio of the call to my server over a WebSocket, and run my own everything. Now the flow is:

flowchart TB
    Caller([📞 Caller on PSTN]) <-->|phone audio| Telnyx[Telnyx Call Control]

    subgraph CP["Control plane: HTTP webhooks"]
        direction LR
        W1["call.initiated<br/>route number, create<br/>record, answer"]
        W2["call.answered<br/>start media stream"]
        W3["call.hangup<br/>clean up, finalize"]
    end

    subgraph DP["Data plane: media WebSocket"]
        direction TB
        VAD["VAD: detect end of turn<br/>(loudness + ~1.5s silence)"]
        STT["Speech-to-Text (Whisper)"]
        AI["AI pipeline<br/>guardrails → search → LLM<br/>(same brain as web chat)"]
        TTS["Text-to-Speech"]
        VAD --> STT --> AI --> TTS
    end

    Telnyx -->|HTTP events| CP
    Telnyx <-->|raw L16 audio| VAD
    TTS -.->|synthesized audio back| Telnyx

There are two clean halves here. The control plane is still webhooks, but tiny now. Just three events:

  • call starts → look up which business and which agent this number belongs to, create a call record, answer.
  • call answered → tell Telnyx "stream the audio to this WebSocket."
  • call hangs up → clean up timers, finalize the record.

The data plane is the audio WebSocket, and that's where everything interesting lives.

The big win: the same AI brain that powers web chat now powers the phone. The transcript runs through the identical pipeline of content guardrails, knowledge-base search, the LLM, and output checks. I just wrap it with voice-specific instructions. One brain, two mouths.

That's the architecture. Now the four problems that only exist on a phone.


Problem 1: "Are they done talking?"

In chat, the user presses Enter. That's the turn boundary, handed to you for free. On a phone there's no Enter. You get a relentless stream of audio chunks and you have to decide when the caller has finished a thought.

I do the cheap, boring thing that works: measure how loud each chunk is (RMS amplitude), and call it "end of turn" after about 1.5 seconds of silence following speech. Buffers shorter than ~100ms get thrown away as noise. No ML, no fancy endpointing model. Just a loudness threshold and a silence counter.

It's not glamorous, and it occasionally clips someone who pauses mid-sentence to think. But it's predictable, it adds zero latency, and "predictable" beats "clever" when you're debugging a live phone call.


Problem 2: the AI keeps interviewing itself

Here's a bug that doesn't exist anywhere else. Because the audio stream is bidirectional (my TTS audio goes back out the same pipe the caller's audio comes in), the agent hears its own voice, transcribes it, and treats it as the caller talking. The AI ends up in a conversation with itself. It's funny for about ten seconds.

Two guards fix it:

  1. While I'm playing audio to the caller, I drop every incoming chunk on the floor. The agent is deaf while it's speaking.
  2. For a full second after I finish speaking, I keep ignoring incoming audio, because there's a tail of echo and network delay where my own voice is still arriving.

Crude? Yes, and it's a real tradeoff: going deaf while I talk means the caller can't interrupt me, which is closer to a walkie-talkie than a natural conversation. But echo cancellation is a rabbit hole, and "go deaf while you talk, plus a one-second cooldown" eliminated the self-conversation completely. It was a debugging-first choice, not ideal conversational UX.


Problem 3: the transcription is just... wrong a lot

Phone audio is 8–16kHz of compressed, noisy garbage compared to a podcast mic. Whisper does its best, but you get transcripts like "I wanna book a point mint for toose day." If you treat that as gospel and the AI replies "I'm sorry, I didn't understand" every third turn, the call is unusable.

The fix wasn't a better STT model. It was telling the AI to expect garbage and guess anyway. Before each turn I inject instructions that say, in effect:

This text came from speech recognition and may be wrong. Figure out what the caller probably meant and help them. Don't say "could you repeat that" over and over. If it's truly unintelligible, ask one specific clarifying question. Keep your answer to 1–2 sentences, because it's going to be read out loud.

"Book a point mint for toose day" becomes "Sure, I can book an appointment for Tuesday. What time works?" The model is a fantastic error-correcting decoder if you give it permission to be one. That instruction prefix did more for call quality than anything I changed in the audio layer.

Two details that mattered. I pass those instructions as a separate system message, not glued onto the transcript, because otherwise the model occasionally repeats them back as if the caller said them. And I cap replies at 1–2 sentences, because nobody wants an AI reading a five-paragraph essay at them over the phone.


Problem 4: one thing at a time

Audio chunks arrive continuously and I process turns as async tasks, so it's entirely possible for two turns to start overlapping: two TTS clips playing at once, two "am I speaking?" flags fighting each other. I wrap the AI-plus-speak-plus-playback part of each turn in a lock so exactly one turn runs at a time. Simple, and it kills a whole category of race conditions.


The whole turn, as one state machine

Those four problems aren't separate features. They're a single loop. Here's the life of one conversational turn, including the deaf-while-speaking and cooldown states that keep the bot from hearing itself:

stateDiagram-v2
    [*] --> Listening
    Listening --> Capturing: caller speaks, loud enough
    Capturing --> Listening: too short, discard as noise
    Capturing --> Processing: ~1.5s of silence
    Processing --> Speaking: STT then AI then TTS
    Processing --> [*]: caller said goodbye
    Speaking --> Cooldown: playback finished
    Cooldown --> Listening: 1s echo guard elapsed

    note right of Speaking
        Deaf while speaking:
        every inbound chunk dropped
    end note
    note right of Cooldown
        Still deaf for 1s:
        tail echo is still arriving
    end note

How the pros do this

Before you conclude I invented something weird in a basement: I didn't. The cascade I just described (telephony → audio stream → speech-to-text → LLM → text-to-speech → back) is the standard voice-agent architecture. It's what Pipecat AI and LiveKit Agents are frameworks for, and what platforms like Vapi, Retell, Deepgram's Voice Agent API, and ElevenLabs' Conversational AI all run under the hood. Pipecat in particular follows the same shape as what's in this post: transport → VAD → STT → LLM → TTS → transport, frame by frame. I hand-rolled a mini-Pipecat, emphasis on mini. The frameworks do the hard parts properly (interruption handling, streaming orchestration, partial-transcript routing) where I cut corners. If I were starting today and didn't want to learn these lessons the hard way, I'd reach for one of those frameworks first.

Where the serious systems pull ahead is that they replace each of my deliberately crude mechanisms with a purpose-built model. My "wait for 1.5 seconds of silence" turn detection becomes a semantic turn-taking model (Deepgram's endpointing, ElevenLabs' dedicated turn-taking model, LiveKit's turn detector) that knows the difference between "I'm done" and "I'm thinking." My "go deaf while I'm speaking" echo guard becomes real acoustic echo cancellation plus true barge-in that cuts the bot off mid-sentence the instant you interrupt. And my batch "synthesize the whole reply, then play it" becomes streaming TTS, fed token-by-token straight from the LLM so the caller hears the first words while the rest is still generating:

PieceMy crude versionThe production version
Turn detectionLoudness + 1.5s silenceSemantic turn-taking / endpointing model
Echo / interruptionGo deaf while speakingAcoustic echo cancellation + real barge-in
Text-to-speechBatch, then playStreaming, fed from LLM tokens
Speech-to-textBuffer a turn, transcribe onceContinuous streaming with partial results

There's also a second paradigm worth knowing about, because it changes the whole picture. Everything above is a cascade: three separate models in a row, flexible and debuggable but paying a latency tax at every hop. The frontier (Google's Gemini Live, OpenAI's Realtime API) is moving to speech-to-speech. From the developer's perspective it's one model that takes audio in and emits audio out, with no separate transcription or synthesis step to wire up. It's lower latency and far better at tone, laughter, and interruptions. But it's a black box you can't inspect or swap pieces of, which is the exact problem that made me abandon "let the phone company do it" in the first place. Google is the tell here. Their contact-center product is a cascade like mine, while their frontier product is speech-to-speech: same company, two answers, because the right one depends on whether you value control or latency more.

flowchart LR
    subgraph Cascade["Cascade: what I built (and Pipecat, Vapi, Deepgram...)"]
        direction LR
        a1([audio in]) --> a2[STT] --> a3[LLM] --> a4[TTS] --> a5([audio out])
    end
    subgraph S2S["Speech-to-speech: the frontier (Gemini Live, OpenAI Realtime)"]
        direction LR
        b1([audio in]) --> b2[one model] --> b3([audio out])
    end
    Cascade ~~~ S2S

Three boxes, three latency hops, three things you can swap and debug. Versus one box that's faster and more natural but that you can't open up.

So here's the honest placement of this project. The core cascade architecture is industry-standard, the mechanisms are the simple-but-debuggable versions of what the specialists productize, and the next rung up the ladder is either swapping in better models for each stage or collapsing the whole cascade into a realtime speech-to-speech model.


What it costs, and what I'd do next

Every turn logs its budget: STT time / AI time / TTS time. That single log line is the most useful thing I added, because on a phone call latency is the product. A 4-second silence after someone asks a question feels broken even if the answer is perfect. In my setup, the LLM call dominated that budget, which points at the obvious next moves: stream the TTS so the caller hears the first words while the rest is still generating, and start synthesizing speech from the model's tokens as they arrive instead of waiting for the full reply.

I'd also replace the loudness-based turn detection with a real endpointing model, and graduate the echo guard from "go deaf for a second" to actual acoustic echo cancellation. None of that was needed to ship something that works, though, and that's the point. The crude versions held up, and they were debuggable at 2am with a phone in one hand.


The takeaway

The interesting part of voice AI turned out not to be the AI. It's the seam between a real-time audio stream and a slow, fallible language model: knowing when someone's done talking, stopping the bot from hearing itself, making the model robust to a transcriber that's wrong a third of the time, and watching your latency budget like it's the only metric that matters. Get those right with embarrassingly simple mechanisms, and the LLM part, the part everyone thinks is hard, is genuinely the easy bit you already built for chat.