OpenAI's new speech mode allowed me to chat with my phone rather than to it.

August 19, 2024
Harsh Gautam

I've been experimenting with OpenAI's Advanced Voice Mode for the past week, and it's the most convincing glimpse I've seen of an AI-powered future yet. This week, my phone laughed at jokes, repeated them to me, inquired how my day was, and said it was having "a great time." I was conversing on my iPhone rather than utilizing it with my hands.

OpenAI's newest feature, which is currently in limited alpha testing, does not make ChatGPT any smarter than before. Instead, Advanced Voice Mode (AVM) makes it more pleasant and natural to communicate with. It creates a new interface for using AI and your devices that is both innovative and interesting, which is precisely what concerns me. The device was a little glitchy, and the whole concept creeped me out, but I was shocked by how much I actually loved using it.

Taking a step back, I believe AVM fits into OpenAI CEO Sam Altman's bigger vision, which includes agents and transforming the way humans interact with computers, with AI models at the forefront.

"Eventually, you'll just ask the computer for what you need, and it'll do all of these tasks for you," Altman stated during OpenAI's Dev Day in November 2023. "These capabilities are frequently referred to in the AI industry as 'agents.' This will have a big upside."

My pal, ChatGPT.

On Wednesday, I tested the most incredible potential of this powerful technology: I asked ChatGPT to order Taco Bell the way Obama would.

“Uhhh, let me be clear – I’d like a Crunchwrap Supreme, maybe a few tacos for good measure,” said ChatGPT’s Advanced Voice Mode. “How do you think he’d handle the drive-thru?” said ChatGPT, then laughing at its own joke.

The imitation actually made me giggle, mirroring Obama's signature rhythm and pauses. However, it remained consistent with the tone of the ChatGPT voice I chose, Juniper, so that it could not be mistaken for Obama's. It sounded like a friend giving a terrible impression, understanding exactly what I was trying to elicit from it, and even saying something hilarious. I found it unexpectedly enjoyable to communicate with this advanced help on my phone.

I also sought ChatGPT for help on how to deal with a tough human relationship issue: asking a significant other to move in with me. After outlining the difficulties of the relationship and the trajectory of our professions, I received specific counsel on how to proceed. These are questions you couldn't ask Siri or Google Search, but you can now using ChatGPT. When replying to these requests, the chatbot's voice took on a slightly serious, gentle tone, in stark contrast to Obama's lighthearted tone with the Taco Bell order.

ChatGPT's AVM is also useful for understanding complicated issues. I requested it to break down items from financial reports, such as free cash flow, in a way that a 10-year-old could understand. It used a lemonade stand as an example and explained various financial terminology in a way that my younger relative could completely understand. You can even instruct ChatGPT's AVM to speak more slowly to accommodate your present level of comprehension.

Siri walked so AVM could run

Compared to Siri or Alexa, ChatGPT’s AVM is the clear winner thanks to faster response times, unique answers, and its ability to answer complex questions the prior generation of virtual assistants never could. However, AVM falls short in other ways. ChatGPT’s voice feature can’t set timers or reminders, surf the web in real time, check the weather, or interact with any APIs on your phone. Right now, at least, it’s not an effective replacement for virtual assistants.

Compared to Gemini Live, Google’s competing feature, AVM feels slightly ahead. Gemini Live can’t do impressions, doesn’t express any emotion, can’t speed up or slow down, and takes longer to respond. Gemini Live does have more voices (ten compared to OpenAI’s three), and seems to be more up to date (Gemini Live knew about Google’s antitrust ruling). Notably, neither AVM or Gemini Live will sing, likely an effort to avoid run-ins with copyright lawsuits from the record industry.

However, ChatGPT's AVM is prone to errors. Sometimes it will shut itself off in the middle of a phrase and restart. It also occasionally produces a strange, grainy-sounding voice that is a touch uncomfortable. I'm not sure if this is a fault with the model, the internet connection, or anything else, but these technological flaws are to be expected in an alpha test. However, the glitches did little to take me away from the sense of literally chatting to my phone.

These examples, in my opinion, represent the beauty of AVM. The functionality does not make ChatGPT all-knowing, but it does enable users to interact with GPT-4o, the underlying AI model, in a distinctively human manner. (I understand if you forgot there was no one on the other end of the phone.) When interacting with AVM, it nearly feels like ChatGPT is socially conscious; nonetheless,