SpeechRecognize[audio]
recognizes speech in audio and returns it as a string.
SpeechRecognize[audio,level]
returns a list of strings at the specified structural level.
SpeechRecognize[audio,level,prop]
returns prop for text at the given level.
SpeechRecognize
SpeechRecognize[audio]
recognizes speech in audio and returns it as a string.
SpeechRecognize[audio,level]
returns a list of strings at the specified structural level.
SpeechRecognize[audio,level,prop]
returns prop for text at the given level.
Details and Options
- Speech recognition aims to convert a spoken audio signal to text. It is also known as speech-to-text and is typically used in voice-enabled human-machine interactions and digital personal assistants.
- SpeechRecognize[audio] returns all recognized speech in audio as a single string.
- Structural elements specified in level include:
-
Automatic speech found in the whole audio signal (default) "Segment" a list of transcription segments "Sentence" a list of sentences "Word" a list of words - The property prop can be one of the following:
-
"Audio" trimmed audio containing the recognized text "Confidence" strength of the recognized text "Interval" interval containing the text "SubtitleRules" a list of time intervals and texts "Text" recognized text (default) {prop1,prop2,…} a list of properties - The following options can be given:
-
Language Automatic the language to recognize Masking All interval of interest Method Automatic the method to use PerformanceGoal $PerformanceGoal aspects of performance to try to optimize ProgressReporting $ProgressReporting whether to report the progress of the computation TargetDevice "CPU" the device on which to perform recognition - Use Languagelang1lang2 to recognize speech assumed to be in language lang1 and return translated text in language lang2.
- By default, speech in the whole signal is recognized. Use Masking->{int1,int2,…} to limit the recognition to intervals inti.
- Possible settings for Method are:
-
Automatic automatic method "GoogleSpeech" uses Google speech-to-text "NeuralNetwork" uses built-in neural networks "OpenAI" uses OpenAI speech-to-text - By default, if a method returns non-speech tokens (e.g. [applause]), they are returned in the result. Use Method{method,"NonSpeechReplacement"replacements} to specify different replacements. Use "NonSpeechReplacement""" to remove them.
- SpeechRecognize works for English speech as well as various other languages, such as Chinese, Dutch, French, Japanese and Spanish.
- SpeechRecognize uses machine learning. Its methods, training sets and biases included therein may change and yield varied results in different versions of the Wolfram Language.
- SpeechRecognize may download resources that will be stored in your local object store at $LocalBase, and can be listed using LocalObjects[] and removed using ResourceRemove.
Examples
open all close allBasic Examples (2)
Scope (4)
Basic Uses (2)
Recognize speech in a short audio track:
SpeechRecognize[ExampleData[{"Audio", "Apollo11ReturnSafely"}]]SpeechRecognize[ExampleData[{"Audio", "Apollo11SmallStep"}]]Recognize speech in an audio track of a video file:
SpeechRecognize[Audio@Video["ExampleData/bullfinch.mkv", AudioTrackSelection -> 2]]Recognize speech in a non-English language:
SpeechRecognize[Audio@Video["ExampleData/bullfinch.mkv", AudioTrackSelection -> 3]]Classify the language from the recognized text:
Classify["Language"][%]Classify the language from the original audio:
Classify["SpokenLanguage"][Audio@Video["ExampleData/bullfinch.mkv", AudioTrackSelection -> 3]]Level Specification (1)
By default, all recognized text is returned as one string:
a = Audio[Video["ExampleData/bullfinch.mkv", AudioTrackSelection -> 2]]SpeechRecognize[a]Extract a list of recognized sentences:
SpeechRecognize[a, "Sentence"]//ColumnSpeechRecognize[a, "Word"][[ ;; 10]]//ColumnExtract a list of segments, typically used for splitting text for subtitles:
SpeechRecognize[a, "Segment"]//ColumnProperties (1)
By default, recognized speech is returned as a string or as lists of strings:
SpeechRecognize[\!\(\*AudioBox[""]\), "Word"]Return the speech interval, corresponding chunk of the audio and recognition strength:
SpeechRecognize[\!\(\*AudioBox[""]\), "Word", {"Text", "Interval", "Audio", "Confidence"}]//TextGrid[#, Frame -> All]&Options (3)
Masking (1)
Use the Masking option to recognize parts of a signal:
a = ExampleData[{"Audio", "MaleVoice"}];SpeechRecognize[a, Masking -> {1, 2}]Method (1)
By default, a local model is used for speech recognition:
SpeechRecognize[\!\(\*AudioBox[""]\)]Use OpenAI speech recognition:
SpeechRecognize[\!\(\*AudioBox[""]\), Method -> "OpenAI"]Use GoogleSpeech speech recognition:
SpeechRecognize[\!\(\*AudioBox[""]\), Method -> "GoogleSpeech"]PerformanceGoal (1)
By default, a medium-speed model with moderate quality is used:
a = \!\(\*AudioBox[""]\);SpeechRecognize[a]SpeechRecognize[a, PerformanceGoal -> "Speed"]//AbsoluteTimingGet the higher-quality result:
SpeechRecognize[a, PerformanceGoal -> "Quality"]//AbsoluteTimingA balanced speed and quality result:
SpeechRecognize[a, PerformanceGoal -> "Balanced"]//AbsoluteTimingApplications (4)
Use AudioIntervals to select which parts of the signal to recognize:
a = ExampleData[{"Audio", "MaleVoice"}];intervals = AudioIntervals[a, "Audible"]SpeechRecognize[a, Masking -> intervals]Interpreter["City"][SpeechRecognize[SpeechSynthesize["Chicago"]]]Show the recognized city on the map:
GeoGraphics[GeoMarker[%]]Find the answer from a spoken question in a text:
FindTextualAnswerFromSpeech[text_] := FindTextualAnswer[text, Echo[SpeechRecognize[AudioCapture[]]]]FindTextualAnswerFromSpeech["Paris is the capital and most populous city of France, with a 2015 population of 2,229,621."]Build an automatic assistant based on Wolfram|Alpha:
DynamicModule[{s = AudioStream[$DefaultAudioInputDevice], result = "___", string = "___"},
Dynamic[Column[{
Button[
If[s["Status"] === "Stopped", "Record", "Stop"],
If[s["Status"] === "Stopped",
AudioRecord[s];,
AudioStop[s];
string = SpeechRecognize[Normal@Audio[s]];
result = Normal[SpeechSynthesize[WolframAlpha[string, "SpokenResult"]]];
AudioPlay[result];
]
],
string, result
}]]
]Related Guides
Text
Wolfram Research (2019), SpeechRecognize, Wolfram Language function, https://reference.wolfram.com/language/ref/SpeechRecognize.html (updated 2024).
CMS
Wolfram Language. 2019. "SpeechRecognize." Wolfram Language & System Documentation Center. Wolfram Research. Last Modified 2024. https://reference.wolfram.com/language/ref/SpeechRecognize.html.
APA
Wolfram Language. (2019). SpeechRecognize. Wolfram Language & System Documentation Center. Retrieved from https://reference.wolfram.com/language/ref/SpeechRecognize.html
BibTeX
@misc{reference.wolfram_2026_speechrecognize, author="Wolfram Research", title="{SpeechRecognize}", year="2024", howpublished="\url{https://reference.wolfram.com/language/ref/SpeechRecognize.html}", note=[Accessed: 13-June-2026]}
BibLaTeX
@online{reference.wolfram_2026_speechrecognize, organization={Wolfram Research}, title={SpeechRecognize}, year={2024}, url={https://reference.wolfram.com/language/ref/SpeechRecognize.html}, note=[Accessed: 13-June-2026]}