Wolfram Language & System Documentation Center

"GoogleSpeech" (Service Connection)

See Also
- ServiceExecute
- ServiceConnect
- SpeechRecognize
- SpeechSynthesize
- VoiceStyleData
- Service Connections
- OpenAI
- See Also
  - ServiceExecute
  - ServiceConnect
  - SpeechRecognize
  - SpeechSynthesize
  - VoiceStyleData
  - Service Connections
  - OpenAI

"GoogleSpeech" (Service Connection)

Use Google Text-to-Speech and Speech-to-Text APIs with the Wolfram Language.

Connecting & Authenticating

ServiceConnect["GoogleSpeech"] creates a connection to the Google Speech-to-Text and Text-to-Speech APIs. If a previously saved connection can be found, it will be used; otherwise, a new authentication request will be launched.

Use of this connection requires internet access and a Google API account.

Requests

ServiceExecute["GoogleSpeech","request",params] sends a request to either of the Google Speech-to-Text or Text-to-Speech APIs, using parameters params. The following give possible requests.

Synthesize Audio from Text

"ListVoices" — returns a list of available voice styles

Parameters:

Language

All

restrict the query to voices able to synthesize a given language

"Synthesize" — returns speech synthesized from text

Parameters:

"Input"	(required)	text to synthesize
"Voice"	Automatic	name of the synthesis voice
Language	Automatic	language of the synthesis voice
"Pitch"	Automatic	semitone deviation from the native voice pitch
"Rate"	Automatic	factor by which to change the native voice speed
AudioEncoding	Automatic	output audio encoding
GeneratedAssetLocation	$GeneratedAssetLocation	storage location of the synthesized audio
GeneratedAssetFormat	Automatic	output format of the synthesized audio
"EffectsProfileID"	Automatic	post-processing effect name applied to speech

Recognize Text from Audio

"Recognize" — returns text transcribed from audio

Parameters:

"Input"	(required)	audio to transcribe
Language	"English"	language(s) of the contained speech
"ChannelRecognition"	False	whether to transcribe each channel separately
MaxItems	1	maximum number of hypotheses to return
"ProfanityFilter"	False	whether to attempt to replace profanities
"SpeechContexts"	{}	phrase hints to assist transcription
"WordTimeOffsets"	True	return word time offsets with the result
"WordConfidence"	False	return word confidence values with the result
"Punctuation"	True	include punctuation in the transcription
"SpokenPunctuation"	False	replace spoken punctuation with ASCII character
"SpokenEmojis"	False	replace spoken emojis with Unicode character
"SpeakerDiarization"	False	tag distinct speakers in the result
"Model"	Automatic	specify a model to use for the request
MetaInformation	None	metadata describing the input audio

Parameter Details

Possible values for "Voice" can be retrieved using the "ListVoices" request.

Possible values for "Rate" are real numbers representing a factor (1 is the natural rate).

Possible values for "Pitch" are real numbers or quantities representing semitones (0 is the natural pitch).

"SpeakerDiarization" accepts the speaker count to detect as {max} or {min,max}.

Possible settings for "SpeechContexts" include:

	strw	give weight w to the string str
	{str₁w₁,str₂w₂,…}	give weight w_i to the string str_i

Examples of possible settings for "EffectsProfileID" include:

	"large-automotive-class-device"	optimized for car speakers
	"small-bluetooth-speaker-class-device"	optimized for small home speakers

Examples of possible settings for "Model" include:

	"latest_long"	optimized for long-form content
	"latest_short"	optimized for short-form content
	"command_and_search"	optimized for short queries

Examples

open all close all

Basic Examples (1)

Connect to Google speech service:

Wolfram Language code: google = ServiceConnect["GoogleSpeech"]

Perform text-to-speech:

Wolfram Language code: ServiceExecute[google, "Synthesize", {"Input" -> "Hello, world!"}]

Perform speech-to-text:

Wolfram Language code:

a = ExampleData[{"Audio", "MaleVoice"}, "Audio"];
ServiceExecute[google, "Recognize", {"Input" -> a, Language -> "en-GB"}]

Scope (5)

Speech Synthesis (3)

Synthesize audio from text:

Wolfram Language code: google = ServiceConnect["GoogleSpeech"]

Wolfram Language code: ServiceExecute[google, "Synthesize", {"Input" -> "Hello, world!"}]

Synthesize text in a different language. Setting "Language" to Automatic will infer the language from the input text, or a particular language can be specified. The service will attempt to select a voice style with the requested language:

Wolfram Language code: ServiceExecute[google, "Synthesize", {"Input" -> "Hola mundo!", Language -> Automatic}]

Use an explicit language:

Wolfram Language code: ServiceExecute[google, "Synthesize", {"Input" -> "Hola mundo!", Language -> Entity["Language", "Spanish::77gfp"]}]

List available voice styles:

Wolfram Language code:

voices = ServiceExecute["GoogleSpeech", "ListVoices"];
voices//Dataset

Synthesize speech using a particular voice:

Wolfram Language code: vox = SelectFirst[voices, #Gender === "Female" && MemberQ[#Language, "Polish"]&]

Wolfram Language code: ServiceExecute["GoogleSpeech", "Synthesize", {"Input" -> "Witam świecie!", "Voice" -> vox["Name"]}]

Make the speech faster and lower in pitch:

Wolfram Language code: ServiceExecute["GoogleSpeech", "Synthesize", {"Input" -> "Hello, world!", "Pitch" -> -10, "Rate" -> 1.5}]

Speech Recognition (2)

Transcribe text from audio containing speech:

Wolfram Language code: a = ExampleData[{"Audio", "MaleVoice"}, "Audio"]

By default, everything from the API response is returned, including information about recognized words:

Wolfram Language code: ServiceExecute["GoogleSpeech", "Recognize", {"Input" -> a, Language -> "en-GB"}]

Return multiple guesses of the transcription:

Wolfram Language code:

ServiceExecute["GoogleSpeech", "Recognize", {"Input" -> a, Language -> "en-GB", MaxItems -> 4, "ReturnProperties" -> "transcript"}]

Separate different speakers from a recording:

Wolfram Language code:

in = AudioJoin[AudioChannelMix[#, 1]& /@ {ExampleData[{"Audio", "MaleVoice"}], AudioTrim[ExampleData[{"Audio", "FemaleVoice"}], 2.14]}]//Normal

Specify the minimum and maximum number of speakers:

Wolfram Language code:

res = ServiceExecute["GoogleSpeech", "Recognize", {"Input" -> in, "SpeakerDiarization" -> {2, 2}, "WordConfidence" -> False}]

Display labeled words in a Dataset. The API currently returns speaker labels in the second result:

Wolfram Language code: Dataset[Lookup[Lookup[Lookup[res, "results"][[2]], "alternatives"][[1]], "words"]]

Top

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

"GoogleSpeech" (Service Connection)

Connecting & Authenticating

Requests

Synthesize Audio from Text

Recognize Text from Audio

Parameter Details

Examples

Basic Examples (1)

Scope (5)

Speech Synthesis (3)

Speech Recognition (2)

"GoogleSpeech" (Service Connection)

Connecting & Authenticating

Requests

Synthesize Audio from Text

Recognize Text from Audio

Parameter Details

Examples

Basic Examples (1)

Scope (5)

Speech Synthesis (3)

Speech Recognition (2)

See Also

History