Text to speech with Amazon Polly – AWS Application Services for AI/ML – MLS-C01 Study Guide

Text to speech with Amazon Polly

Amazon Polly is all about converting text into speech, and it does so by using pretrained deep learning models. It is a fully managed service, so you do not have to do anything. You provide the plain text as input for synthesizing or in Speech Synthesis Markup Language (SSML) format so that an audio stream is returned. It also gives you different languages and voices to choose from, with both male and female options. The output audio from Amazon Polly can be saved in MP3 format for further use in the application (web or mobile) or can be a JSON output for written speech.

For example, if you were to input the text “Baba went to the library” into Amazon Polly, the output speech mark object would look as follows:

{“time”:370,”type”:”word”,”start”:5,”end”:9,”value”:”went”}

The word “went” begins 370 milliseconds after the audio stream begins, and starts at byte 5 and ends at byte 9 of the given input text.

It also returns output in ogg_vorbis and pcm format. When pcm is used, the content that is returned as an output is audio/pcm in a signed 16-bit, 1-channel (mono), little-endian format.

Some common uses of Amazon Polly include the following:

  • Can be used as an accessibility tool for reading web content.
  • Can be integrated with Amazon Rekognition to help visually impaired people read signs. You can click a picture of the sign with text and feed it to Amazon Rekognition to extract text. The output text can be used as input for Polly, and it will return a voice as output.
  • Can be used in a public address system, where the admin team can just pass on the text to be announced and Amazon Polly does the magic.
  • By combining Amazon Polly with Amazon Connect (telephony backend service), you can build an audio/video receiver (AVR) system.
  • Smart devices such as smart TVs, smart watches, and Internet of Things (IoT) devices can use this for audio output.
  • Narration generation.
  • When combined with Amazon Lex, full-blown voice user interfaces for applications can be developed.

Now, let’s explore the benefits of Amazon Polly.

Exploring the benefits of Amazon Polly

Some of the benefits of using Amazon Polly include the following:

  • This service is fully managed and does not require any admin cost to maintain or manage resources.
  • It provides an instant speech correction and enhancement facility.
  • You can develop your own access layer using the HTTP API from Amazon Polly. Development is easy due to the huge amount of language support that’s available, such as Python, Ruby, Go, C++, Java, and Node.js.
  • For certain neural voices, speech can be synthesized using the Newscaster style, to make them sound like a TV or radio broadcaster.
  • Amazon Polly also allows you to modify the pronunciation of particular words or the use of new words.

Next, you’ll get hands-on with Amazon Polly.