Working draft — Sancto AI is expanding this with anonymized data from our last three voice deployments.
The three viable paths
- Full-stack vendor (Retell, Vapi, Bland, Synthflow). They give you a phone number, a builder, and an LLM behind it. Live in days.
- Component vendors (Twilio + Deepgram + OpenAI Realtime). You orchestrate. More control. More code.
- Hybrid. Vendor for telephony + STT, custom for LLM logic + tool calls. Our default.
Cost curves (per minute, talk time)
| Path | Cost / min | Setup time |
|---|---|---|
| Retell / Vapi / Bland | $0.18–$0.32 | 1–5 days |
| Twilio + Deepgram + OpenAI Realtime (DIY) | $0.10–$0.18 | 3–6 weeks |
| Hybrid (Twilio + your LLM) | $0.12–$0.22 | 2–4 weeks |
Crossover point: roughly 10,000+ minutes/month. Below that, vendors win on TCO. Above that, building wins — sometimes dramatically (5,000 minutes/day ≈ $4k–$8k/mo on vendor vs $2k–$3k DIY).
Where vendors win
- Speed to first customer call
- Out-of-box: barge-in, interruption handling, voice variety
- No telephony expertise required
- SIP, transfers, IVR fallback — all handled
Where building wins
- Per-minute cost at volume
- Custom tool calls (CRM lookup mid-call, calendar booking with custom rules)
- Data residency (vendor sends audio to their cloud — you may not be able to)
- Multi-language with consistent quality across all
What kills voice projects before launch
- Latency. Anything over 800ms response feels broken. Test in production-like conditions, not localhost.
- Interruption handling. Humans interrupt. Your agent has to stop talking immediately and resume sensibly.
- Hallucinated bookings. The model confidently writes "Tuesday at 3pm" when the calendar shows 4pm. Always confirm tool outputs back to the caller.
- The 5% accent failure. 95% accuracy on accents sounds great until you remember 5% of your customers can't use the product.
Our recommendation
Under 5k minutes/month, single language, simple flow: Retell or Vapi. Done in a week, move on.
5k–30k minutes/month, custom integrations needed: Hybrid. Telephony from vendor, brain from you.
30k+ minutes/month or strict data residency: Full DIY. It's a project, not a config — but the unit economics demand it.
Voice AI is the rare AI product where the LLM is the easy part. The other 80% — telephony, latency, interruptions, tool calling — is what eats your timeline.