Mxli's personal page

📣 vbot — voice-first LLM interactions

Over the past months I’ve been working on a Twilio-based project (internally dubbed wappler), mainly to explore what it actually takes to build reliable voice-driven applications under real-world constraints.

As part of that work, I decided to step back and investigate a related question:

How far can one get today with a phone-call-based AI using mostly off-the-shelf components?

That exploration resulted in vbot, also referred to as scarlett.

📲 Repo: https://github.com/mxli417/vbot

đź’ˇ Motivation

Voice interfaces are often presented as a solved problem. In practice, they remain surprisingly brittle once one moves beyond controlled demos:

Using Twilio’s ConversationRelay, I wanted to explore how far one can go using:

The goal was not to build a product, but to better understand the constraints and failure modes of voice-first interaction.
With the limited credit available in a trial account, I started building and experimenting.

📡 What ConversationRelay actually does

Twilio’s ConversationRelay abstracts away much of the low-level complexity involved in building voice-driven applications.

At a high level, it:

All of this is configured via TwiML, Twilio’s XML-based markup language for controlling call behavior.

Conceptually, the flow looks like this:


Caller
↓
Twilio Voice
↓
ConversationRelay
├── Speech-to-Text
├── WebSocket ↔ Your Backend
└── Text-to-Speech
↓
Caller

From the application’s perspective, this means you never have to deal with raw audio — only with text input and output.

This abstraction is powerful, but also constraining: it simplifies development significantly, while making latency, streaming behavior, and voice quality largely dependent on the platform.

🤖 What vbot (scarlett) actually does

vbot connects these components in the simplest possible way:

It supports:

The agent itself is intentionally minimal.
My focus was primarily on system behavior: testing Twilio’s capabilities, evaluating voice quality, and identifying the practical pitfalls of voice-based interaction.

📣 Voice choice and perceived intelligence

One of the most striking observations was how strongly voice choice influences perceived intelligence, likability, and usability.

In practice:

Twilio ConversationRelay allows selecting voices from multiple providers, including:

This makes it possible to empirically evaluate how different voices affect interaction quality — something that is difficult to appreciate without running real calls.

🤓 Observations

A few things became clear rather quickly:

In practice, voice AI turns out to be much more about systems engineering than about AI models themselves.

🛣️ Where this might go

I currently see this project primarily as a research playground for exploring:

Not as a product, but as a way to better understand what voice-first LLM interaction can realistically support today.

đź’­ Closing thoughts

vbot (scarlett) is intentionally small and incomplete.

It exists to answer a simple question:

How usable can an LLM be when the keyboard is removed and everything happens through a phone call?

So far, the answer is: more usable than expected — but far from trivial.

If nothing else, it has been a useful way to explore the practical limits of current voice + LLM stacks.

📚 References