

In order to cover some of the vast variety of human speech, we first need to record 10-20 hours of speech in a professional studio. The first phase is to find a professional voice talent whose voice is both pleasant and intelligible and fits the personality of Siri. How Does Speech Synthesis Work?īuilding a high-quality text-to-speech (TTS) system for a personal assistant is not an easy task. In order to provide the best possible quality for Siri's voices across all platforms, Apple is now taking a step forward to utilize deep learning in an on-device hybrid unit selection system.

However, given its extremely high computational cost, it is not yet feasible for a production system. Deep learning has also enabled a completely new approach for speech synthesis called direct waveform modeling (for example using WaveNet ), which has the potential to provide both the high quality of unit selection synthesis and flexibility of parametric synthesis. Parametric synthesis has benefited greatly from deep learning technology. Recently, deep learning has gained momentum in the field of speech technology, largely surpassing conventional techniques, such as hidden Markov models (HMMs). Hybrid unit selection methods are similar to classical unit selection techniques, but they use the parametric approach to predict which units should be selected. Modern unit selection systems combine some of the benefits of the two approaches, and so are referred to as hybrid systems. Therefore, parametric synthesis is often used when the corpus is small or a low footprint is required. On the other hand, parametric synthesis provides highly intelligible and fluent speech, but suffers from lower overall quality. Unit selection synthesis provides the highest quality given a sufficient amount of high-quality speech recordings, and thus it is the most widely used speech synthesis technique in commercial products. There are essentially two speech synthesis techniques used in the industry: unit selection and parametric synthesis. Recently, combined with speech recognition, speech synthesis has become an integral part of virtual personal assistants, such as Siri. Speech synthesis-the artificial production of human speech-is widely used for various applications from assistive technology to gaming and entertainment. This article presents more details about the deep learning based technology behind Siri’s voice. The resulting voices are more natural, smoother, and allow Siri’s personality to shine through. Starting in iOS 10 and continuing with new features in iOS 11, we base Siri voices on deep learning. Siri is a personal assistant that communicates using speech synthesis.
