Ultravox

Ultravox, a new multimodal LLM, can understand text and human speech without separate ASR. It extends open-weight LLMs with a multimodal projector that converts audio to LLM’s high-dimensional space. Trained on Llama 3, Mistral, and Gemma, Ultravox responds faster than systems with separate ASR and LLM components. Future versions will natively understand timing and emotion cues in human speech.

Ultravox currently takes in audio and emits streaming text. As it evolves, it will emit a stream of speech tokens that can be converted to raw audio by a vocoder. Ultravox can be trained against any open-weight model. See below for training details.