Having built with and tried every voice model over the last three years, real time and non-real time... this is off the charts compared to anything I've seen before.
Doesn't seem to work for me - tried in both Firefox and Chromium and I can see the waveform when I talk but the transcription just shows "Awaiting audio input".
Thank you for the link! Their playground in Mistral does not have a microphone. it just uploads files, which does not demonstrate the speed and accuracy, but the link you shared does.
I tried speaking in 2 languages at once, and it picked it up correctly. Truly impressive for real-time.
Impressive indeed. Works way better than the speech recognition I first got demo'ed in... 1998? I remember you had to "click" on the mic everytime you wanted to speak and, well, not only the transcription was bad, it was so bad that it'd try to interpret the sound of the click as a word.
It was so bad I told several people not to invest in what was back then a national tech darling:
> I tried speaking in 2 languages at once, and it picked it up correctly.
I'm a native french speaker and I tried with a very simple sentence mixing french and english:
"Pour un pistolet je prefere un red dot mais pour une carabine je prefere un ACOG" (aka "For a pistol I prefer a red dot but for a carbine I prefer an ACOG")
And instead I got this:
"Je prépare un redote, mais pour une carabine, je préfère un ACOG."
"Je prépare un redote ..." doesn't mean anything and it's not at all what I said.
I like it, it's impressive, but literally the first sentence I tried it got the first half entirely wrong.
Wow, that’s weird. I tried Bengali, but the text transcribed into Hindi!I know there are some similar words in these languages, but I used pure Bengali that is not similar to Hindi.
Well, on the linked page, it mentions "strong transcription performance in 13 languages, including [...] Hindi" but with no mention of Bengali. It probably doesn't know a lick of Bengali, and is just trying to snap your words into the closest language it does know.
I’ve been using AquaVoice for real-time transcription for a while now, and it has become a core part of my workflow. It gets everything, jargon, capitalization, everything. Now I’m looking forward to doing that with 100% local inference!
Not terrible. It missed or mixed up a lot of words when I was speaking quickly (and not enunciating very well), but it does well with normal-paced speech.
Yeah it messed up a bit for me too when I didn't enunciate well. If I speak clearly it seems to work very well even with background noise. Remember Dragon Naturally Speaking? Imagine having this back then!
In English it is pretty good. But talk to it in Polish, and suddenly it thinks you speak Russian? Ukranian? Belarus? I would understand if an American company launched this, but for a company being so proud about their European roots, I think it should have better support for major European languages.
I tried English + Polish:
> All right, I'm not really sure if transcribing this makes a lot of sense. Maybe not. A цьому nie mówisz po polsku. A цьому nie mówisz po polsku, nie po ukrańsku.
They don't claim to support Polish, but they do support Russian.
> The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch. With a 4B parameter footprint, it runs efficiently on edge devices, ensuring privacy and security for sensitive deployments.
I wonder how much having languages with the same roots (e.g. the romance languages in the list above or multiple Slavic languages) affects the parameter count and the training set. Do you need more training data to differentiate between multiple similar languages? How would swapping, for example, Hindi (fairly distinct from the other 12 supported languages) for Ukrainian and Polish (both share some roots with Russian) affect the parameter count?
Swahili is subcontinental lingua franca spoken by 200M people and growing quickly. Polish is spoken by a shrinking population in one country where English is understood anyways.
> The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch.
Yeah, it's too bad. Apparently it only performs well in certain languages: "The model is natively multilingual, achieving strong transcription performance in 13 languages, including English, Chinese, Hindi, Spanish, Arabic, French, Portuguese, Russian, German, Japanese, Korean, Italian, and Dutch"
polish logically should be rendered in cyrillic as the cyrillic orthography more closely matches the sounds and consonant structure of slavic languages like polish and russian, although this has never been done for church reasons . maybe this is confusing ai
That's a mix of Polish and Ukrainian in the transcript. Now, if I try speaking Ukrainian, I'm getting transcript in Russian every time. That's upsetting.
Oh no! The model won't translate to an unsupported language, and incorrectly reverts to one that it was explicitly trained on.
The base likely was pretrained on days that included Polish and Ukrainian. You shouldn't be surprised to learn it doesn't perform great on languages it wasn't trained on, or perhaps had the highest share of training data.
I'm not sure why but their multilingual performance in general has usually been below average. For a French company, their models are not even close to being best in French, even outdone by the likes of Qwen. I don't think they're focusing on anything but English, the rest is just marketing.
Is it 0.003 per minute of audio uploaded, or "compute minute"?
For example fal.ai has a Whisper API endpoint priced at "$0.00125 per compute second" which (at 10-25x realtime) is EXTREMELY cheaper than all the competitors.
Incroyable! Competitive (if not better) than deepgram nova-3, and much better than assembly and elevenlabs in basically all cases on our internal streaming benchmarking.
The dataset is ~100 8kHz call recordings with gnarly UK accents (which I consider to be the final boss of english language ASR). It seems like it's SOTA.
Where it does fall down seems to be the latency distribution but I'm testing against the API. Running it locally will no doubt improve that?
I noticed that this model is multilingual and understands 14 languages. For many use cases, we probably only need a single language, and the extra 13 are simply adding extra latency. I believe there will be a trend in the coming years of trimming the fat off of these jack of all trades models.
Most English speakers likely would understand those and don’t speak French or Spanish. So it’s not necessary to tack on extra languages even if there are loan words.
STT services that have been around for longer, like Azure, Google and Amazon, generally require you to request a specific language, and their quality is a lot higher than models that advertise themselves as LLMs (even though I believe the clouds are also using the same types of models now).
It doesn't make sense to have a language-restricted transcription model because of code switching. People aren't machines, we don't stick to our native languages without failure. Even monolingual people move in and out of their native language when using "borrowed" words/phrases. A single-language model will often fail to deal with that.
Everything is a tradeoff, and different use cases require different tradeoffs:
Option A: this model
Option B: faster model, only 1 language
Option C: same size model, only 1 language but higher quality
My point is that option A isn’t always best.
And on the borrowed words bit, there’s no rule that we cannot add borrowed words into the vocab. But you don’t need the whole language. I know what deja voux means but I don’t speak French.
yeah, one example I run into is getting my perplexity phone assistant to play a song in spanish. I cannot for the life of me get a model to translate:
"Play señorita a mi me gusta su style on spotify" correctly
uhhh i cast doubt on multi-language support as affecting latency. model size, maybe, but what is the mechanism for making latency worse? i think of model latency as O(log(model size))… but i am open to being wrong / that being a not-good mental model / educated guess.
Even model size, it’s modest. There is a lot of machinery that is going to be common for all languages. You don’t multiply model size by 2 when you double the number of supported languages.
If encoding more learned languages and grammars and dictionaries makes the model size bigger, it will also increase latency. Try running a 1B model locally and then try to run a 500B model on the same hardware. You'll notice that latency has rather a lot to do with model size.
They've already done the inverse and trimmed non-coding abilities from their language model: https://openai.com/index/introducing-gpt-5-2-codex/. There's already precedent for creating domain-specific models.
I think it's nice to have specialized models for specific tasks that don't try to be generalists. Voxtral Transcript 2 is already extremely impressive, so imagine how much better it could be if it specialized in specific languages rather than cramming 14 languages into one model.
That said, generalist models definitely have their uses. I do want multilingual transcribing models to exist, I just also think that monolingual models could potentially achieve even better results for that specific language.
I'm so amazed to find out just how close we are to the start trek voice computer.
I used to use Dragon Dictation to draft my first novel, had to learn a 'language' to tell the rudimentary engine how to recognize my speech.
And then I discovered [1] and have been using it for some basic speech recognition, amazed at what a local model can do.
But it can't transcribe any text until I finish recording a file, and then it starts work, so very slow batches in terms of feedback latency cycles.
And now you've posted this cool solution which streams audio chunks to a model in infinite small pieces, amazing, just amazing.
Now if only I can figure out how to contribute to Handy or similar to do that Speech To Text in a streaming mode, STT locally will be a solved problem for me.
Happy to answer questions about this (or work with people on further optimizing the open source inference code here). NVIDIA has more inference tooling coming, but it's also fun to hack on the PyTorch/etc stuff they've released so far.
Yeah, I think the multilingual improvements in V3 caused some kind of regression for English - I've noticed large blocks occasionally dropped as well, so reverted to v2 for my usage. Specifically nvidia/parakeet-tdt-0.6b-v2 vs nvidia/parakeet-tdt-0.6b-v3
Parakeet is really good imo too, and it's just 0.6B so it can actually run on edge devices. 4B is massive, I don't see Voxtral running realtime on an Orin or fitting on a Hailo. An Orin Nano probably can't even load it at BF16.
Do you have experience with that model for diarization? Does it feel accurate, and what's its realtime factor on a typical GPU? Diarization has been the biggest thorn in my side for a long time..
You can test it yourself for free on https://console.mistral.ai/build/audio/speech-to-text
I tried it on an english-speaking podcast episode, and apart from identying one host as two different speakers (but only once for a few sentences at the start), the rest was flawless from what I could see
Played with the demo a bit. It's really good at English, and detects language change on the fly. Impressive.
But whatever I tried, it could not recognise my Ukrainian and would default to Russian in absolutely ridiculous transcription. Other STT models recognise Ukrainian consistently, so I assume there is a lot of Russian in training material, and zero Ukrainian. Made me really sad.
Thats just the result of the model only supporting russian (and 12 other languages) and not urkainian. It maps to the closest words from training data.
Is there an open source Android keyboard that would support it? Everything I find is based on Whisper, which is from 2022. Ages ago given how fast AI is evolving.
I wish I had a Google Keyboard that could easily run on Whisper Medium. This is already great. But unfortunately would be too much inference cost, incredibly slow. The problem with Whisper is not the inference quality: medium and large are incredible. Is that the base model is not enough, and the only one with fast inference in mobile devices.
WER is slightly misleading, but Whisper Large v3 WER is classically around 10%, I think, and 12% with Turbo.
The thing that makes it particularly misleading is that models that do transcription to lowercase and then use inverse text normalization to restore structure and grammar end up making a very different class of mistakes than Whisper, which goes directly to final form text including punctuation and quotes and tone.
But nonetheless, they're claiming such a lower error rate than Whisper that it's almost not in the same bucket.
On the topic of things being misleading, GPT-4o transcriber is a very _different_ transcriber to Whisper. I would say not better or worse, despite characterizations such. So it is a little difficult to compare on just the numbers.
There's a reason that quite a lot of good transcribers still use V2, not V3.
I really wish those offering speech-to-text models provided transcription benchmarks specific to particular fields of endeavor. I imagine performance would vary wildly when using jargon peculiar to software development, medicine, physics, and law, as compared to everyday speech. Considering that "enterprise" use is often specialized or sub-specialized, it seems like they're leaving money on Dragon's table by not catering to any of those needs.
Looks like this model doesn't do realtime diarization, what model should I use if I want that? So far I've only seen paid models do diarization well. I heard about Nvidia NeMo but haven't tried that or even where to try it out.
Italian represents, I believe, the most phonetically advanced human language. It has the right compromise among information density, understandability, and ability to speech much faster to compensate the redundancy. It's like if it had error correction built-in. Note that it's not just that it has the lower error rate, but is also underrepresented in most datasets.
I love seeing people from other countries share their own folk tales about what makes their countries special and unique. I've seen it up close in my country and I always cringed when I heard my fellow countrymen came up with these stories. In my adulthood I'm reassured that it happens everywhere and I find it endearing.
On the information density of languages: it is true that some languages have a more information dense textual representation. But all spoken languages convey about the same information in the same time. Which is not all that surprising, it just means that human brains have an optimal range at which they process information.
Further reading: Coupé, Christophe, et al. "Different Languages, Similar Encoding Efficiency: Comparable Information Rates across the Human Communicative Niche." Science Advances. https://doi.org/10.1126/sciadv.aaw2594
Different representations at the same bitrate may have features that make one a lot more resilient to errors. This thing about Italian, you fill find in any benchmark of vastly different AI transcribing models. You can find similar results also on the way LLMs mostly trained on English generalize usually very well with Italian. All this despite Italian accounting for marginal percentage of the training set. How do you explain that? I always cringe when people refute evidence.
This is largely due to the fact that modern Italian is a systematised language that emerged from a literary movement (whose most prominent representative is Alessandro Manzoni) to establish a uniform language for the Italian people. At the time of Italian unification in 1861, only about 2.5% of the population could speak this language.
The language itself was not invented for the purpose: it was the language spoken in Florence, than adopted by the literary movement and than selected as the national language.
It seems like the best tradeoff between information density and understandability actually comes from the deep latin roots of the language
I was honestly surprised to find it in the first place, because I assumed English to be at first place given the simpler grammar and the huge dataset available.
I agree with your belief, other languages have either lower density (e.g. German) or lower understandability (e.g. English)
English has a ton of homophones, way more sounds that differ slightly (long/short vowels), and major pronunciation differences across major "official" languages (think Australia/US/Canada/UK).
Italian has one official italian (two, if you count IT_ch, but difference is minor), doesn't pay much attention to stress and vowel length, and only has a few "confusable" sounds (gl/l, gn/n, double consonants, stuff you get wrong in primary school). Italian dialects would be a disaster tho :)
That's interesting. As a linguist, I have to say that Haskell is the most computationally advanced programming language, having the best balance of clear syntax and expressiveness. I am qualified to say this because I once used Haskell to make a web site, and I also tried C++ but I kept on getting errors.
/s obviously.
Tldr: computer scientists feel unjustifiably entitled to make scientific-sounding but meaningless pronouncements on topics outside their field of expertise.
At least some relatively well-known research finds that all languages have similar information density in terms of bits/second (~39 bits/second based on a quick search). Languages do it with different amounts of phonetic sound / syllables / words per bit and per second, but the bps comes out the same.
I don't know how widely accepted that conclusion is, what exceptions there may be, etc.
I did transcription for a while in 2021. It is absurdly hard. Especially as these days humans only get the difficult jobs that AI has already taken a stab at.
The hardest one I did was for a sports network where it was a motorcross motorbike event where most of what you could hear was the roar of the bikes. There were two commentators I had to transcribe over the top of that mess and they were using the slang insider nicknames for all the riders, not their published names, so I had to sit and Google forums to find the names of the riders while I was listening. I'm not even sure how these local models would even be able to handle that insanity at all because they almost certainly lack enough domain knowledge.
I was skepitcal upon hearing the figure but various sources do indeed back it up and [0] is a pretty interesting paper (old but still relevant human transcibers haven't changed in accuracy).
I think it's actually hard to verify how correct a transcription is, at scale. Curious where those error rate numbers come from, because they should test it on people actually doing their job.
You missed a giant factor: domain knowledge. Transcribing something outside of your knowledge realm is very hard. I posted above about transcribing the commentary of a motorbike race where the commentators only used the slang names of the riders.
I haven't quite figured out if the open weights they released on huggingface amount to being able to run the (realtime) model locally - i hope so though! For the larger model with diarization I don't think they open sourced anything.
> We've worked hand-in-hand with the vLLM team to have production-grade support for Voxtral Mini 4B Realtime 2602 with vLLM. Special thanks goes out to Joshua Deng, Yu Luo, Chen Zhang, Nick Hill, Nicolò Lucchesi, Roger Wang, and Cyrus Leung for the amazing work and help on building a production-ready audio streaming and realtime system in vLLM.
3 hours for a single request sounds nice to me. Although the graph suggests that it’s not going to perform as good as openai model I have been using, it is open source and surely I will give it a try.
This looks great, but it's not clear to me how to use it for a practical task. I need to transcribe about 10 years worth of monthly meetings. These are government hearings with a variety of speakers. All the videos are on YouTube. What's the most practical and cost-effective way to get reasonably accurate transcripts?
In the api it took 18s to do a 20m audio file I had lying around where someone is reviewing a product.
There will, I'm sure, be ways of running this locally up and available soon (if they aren't in huggingface right now) but the API is $0.003/min. If it's something like 120 meetings (10 years of monthly ones) then it's roughly $20 if the meetings are 1hr each. Depending on whether they're 1 or 10 hours (or if they're weekly or monthly but 10 parallel sessions or something) then this might be a price you're willing to pay if you get the results back in an afternoon.
edit - their realtime model can be run with vllm, the batch model is not open
As a rule of thumb for software that I use regularly, it is very useful to consider the costs over a 10-year period in order to compare it with software that I purchase for lifetime to install at home. So that means 1,798.80 $ for the Pro version.
Smells Like Teen Spirit survives another challenge!
Voxtral Transcribe 2:
Light up our guns, bring your friends, it's fun to lose and to pretend. She's all the more selfish, sure to know how the dirty world. I wasn't what I'd be best before this gift I think best A little girl is always been Always will until again Well, the lights out, it's a stage And we are now entertainers. I'm just stupid and contagious. And we are now entertainers. I'm a lot of, I'm a final. I'm a skater, I'm a freak. Yeah! Hey! Yeah. And I forget just why I taste it Yeah, I guess it makes me smile I found it hard, it's hard to find the well Whatever, never mind Well, the lights out, it's a stage. You and I are now entertainers. I'm just stupid and contagious. You and I are now entertainers. I'm a lot of, I'm a minor. I'm a killer. I'm a beater. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. And I forget just why I taste it Yeah, I guess it makes me smile I found it hard, it's hard to find the well Whatever, never mind I know, I know, I know, I know, I know Well, the lights out, it's a stage. You and I are now entertainers. I'm just stupid and contagious. You and I are now entertainers. I'm a lot of, I'm a minor. I'm a killer. I'm a beater. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd. I'm a nerd.
Google/Musixmatch:
Load up on guns, bring your friends
It's fun to lose and to pretend
She's over-bored, and self-assured
Oh no, I know a dirty word
Hello, hello, hello, how low?
Hello, hello, hello, how low?
Hello, hello, hello, how low?
Hello, hello, hello
With the lights out, it's less dangerous
Here we are now, entertain us
I feel stupid and contagious
Here we are now, entertain us
A mulatto, an albino
A mosquito, my libido, yeah
Hey, yey
I'm worse at what I do best
And for this gift, I feel blessed
Our little group has always been
And always will until the end
Hello, hello, hello, how low?
Hello, hello, hello, how low?
Hello, hello, hello, how low?
Hello, hello, hello
With the lights out, it's less dangerous
Here we are now, entertain us
I feel stupid and contagious
Here we are now, entertain us
A mulatto, an albino
A mosquito, my libido, yeah
Hey, yey
And I forget just why I taste
Oh yeah, I guess it makes me smile
I found it hard, it's hard to find
Oh well, whatever, never mind
Hello, hello, hello, how low?
Hello, hello, hello, how low?
Hello, hello, hello, how low?
Hello, hello, hello
With the lights out, it's less dangerous
Here we are now, entertain us
I feel stupid and contagious
Here we are now, entertain us
A mulatto, an albino
A mosquito, my libido
A denial, a denial
A denial, a denial
A denial, a denial
A denial, a denial
A denial
(when it was released, adults/press/etc. found SLTS famously incomprehensible and then they realized that the kids didn't understand the lyrics either, and Weird Al nailed it with his classic, Smells Like Nirvana: https://www.google.com/search?q=Smells+Like+Nirvana )
One week ago I was on the hunt for an open source model that can do diatization and I had to literally give up because I could not find any easy to use setup.
I don't know if that will change, but right now only the Voxtral Mini Transcribe V2 supports diarization and it's not open-weight.
The Voxtral Realtime model doesn't support diarization, but is open-weight.
does anyone know if there's any desktop tools I can use this transcription model with? e.g. something where like Wisper Flow/WillowVoice but with custom model selection
Disappointing how this lacks a clear reference implementation, if not mixed at almost yet unreleased VLLM (nightly version) stuff. I'm ok with Open Weights being a form of OSS in the case of models, because frankly I don't believe that, for large LLMs, it is feasible to release the training data, all the orchestration stuff, and so forth. But it can't be: here are the weights, we partnered with VLLM for inference. Come on. Open Weights must mean that you put me in a situation to write an implementation easily for any hardware.
p.s. even the demo uses a remote server via websocket.
Pseudo related -- am I the only one uncomfortable using my voice with AI for the concern that once it is in the training model it is forever reproducible? As a non-public person it seems like a risk vector (albeit small),
It's a real issue, but why do you only see it in ai? It's true for any case where you're speaking into a microphone
Depending on the permissions granted to apps on your mobile device, it can even be passively exfiltrated without you ever noticing - and that's ignoring the video clips people take and put online. Like your grandma uploading to Facebook a short moment from a Christmas meet or similar
There have already been successful scams - eg calls from "relatives" (AI) calling family members needing money urgently and convincing them to send the money...
Don't they have a partnership with the French Armed Forces? I am sure they are interested in automating Russian Audio or Text (-> Russian Text) -> French text.