Text to Speech

Convert text to speech

post

Convert text to speech using a specified voice. Supports emotion control, speed adjustment, and multiple output formats.

Authorizations
AuthorizationstringRequired
Body
textstringRequired

Text to convert (max 5000 chars)

voice_idstringOptional

Voice ID to use for synthesis

filestring · binaryOptional

Audio file for voice cloning (alternative to voice_id)

quality_presetintegerOptional

Quality preset for synthesis

Default: 3
output_formatstring · enumOptional

Output audio format (wav/mp3)

Default: wavPossible values:
speednumberOptional

Speech speed multiplier

Default: 1
durationnumberOptional

Target audio duration in seconds

Default: 0
target_langstringOptional

Target language code (e.g., "zh", "en", "zh+en")

similarity_enhbooleanOptional

Whether to enhance voice similarity

Default: false
emostringOptional

Emotion parameters as JSON string, e.g., {"Sadness":0.2, "Surprise":0.5}

trim_silencebooleanOptional

Whether to trim leading and trailing silence from the generated audio

Default: false
save_voicebooleanOptional

Whether to save the uploaded voice file

Default: false
Responses
chevron-right
200

Success — returns audio binary stream

Responsestring · binary
post
/text-to-speech

Emotion enhancement

post

Automatically add emotion annotations to input text. The enhanced text can be used directly in the text-to-speech API for more expressive speech synthesis.

Authorizations
AuthorizationstringRequired
Body
textstringRequired

Text to enhance with emotions (max 5000 chars)

Responses
chevron-right
200

Success

application/json
codeintegerOptional

Status code (0 = success)

messagestringOptional
post
/emotion-enhance
HTTP

Last updated