
The International Telecommunications Union (ITU) recommends a mouth-to-ear latency of less than 400 milliseconds to maintain a natural conversation. “Mouth-to-ear” refers to the length of time between words leaving someone’s lips and hitting the ear, or being heard by the listener. It then usually takes humans a couple of hundred milliseconds to start to respond. All of this means that, in order to mimic human interaction, AI systems must be able to provide a response in a tight time window. The AI’s response will initiate another trip as the sound moves back through the network, allowing the original talker to hear the response. All in all, the whole interaction needs to take around a second, otherwise it will start to feel off. In reality, most voice AI systems are on the cusp of reaching this measure, yet this is improving with new technologies and better techniques.
Latency can make or break effective real-time AI systems. We’ve seen this with cases of latency coupled with missing language support in health care. A startup based in Australia, for example, wanted to use an AI caller to check on elderly Cantonese-speaking patients. This would seem to be a good use of the technology. However, high latencies to US-based voice AI infrastructure, plus a lack of Cantonese TTS, made the experience unnatural.
Solutions to latency problems resemble engineering modifications. You strive to cut latency wherever you can in the development phase. This requires real-time flows, end-to-end—that is, stream in and out concurrently, rather than waiting for the LLM to produce the full text output before passing it to the TTS to be synthesized.

