Company News

The development of speech synthesis chip technology

News by ninechip | On 2018-07-17 08:40

The development of voice synthesis chip technology can be divided into 3 stages.

The first generation of embedded speech synthesis engine (2000): monosyllabic waveform splicing technology:

Monosyllabic waveform joining is the first generation of speech synthesis technology. Simply speaking, the monosyllabic waveform splicing technology is to record all the more than 1400 syllables that need to be used in Chinese pronunciation, and make it into a sound library. When the text is synthesized, the most simple speech synthesis system can be realized by finding the appropriate syllables from the audio library according to the Pinyin of each Chinese character. But the effect of this technique is very difficult to be satisfied. It is a mechanical effect of a single word, and the sentence has a very poor consistency and cannot be used in large scale.

The second generation of embedded speech synthesis engine (2004): large corpus reduction technology

In order to improve the first generation speech synthesis technology, people have thought of the method of large corpus synthesis. In the first generation of synthetic technology, people do not take into account that each syllable is different in the different sentence environment. Each syllable has only one candidate unit, and the splice is very rigid, thus causing the incoherence of the sentence.

In the synthetic method of large corpus, in order to solve the problem of different pronunciation of Chinese characters in different cases, the corpus is further perfected, and the pronunciation candidate units of Chinese syllable under different circumstances are kept in the sound library as much as possible. A syllable candidate unit suitable for the current context is spliced. Therefore, the larger the size of the library is, the more different pronunciation it contains, the closer to the pronunciation of the natural person. At present, professional voice synthesis systems, such as the telecommunication level and the service level voice synthesis system of signal flight, use large corpus or even large corpus algorithms. Each sound library can reach a size of several G. In theory, it can approach the effect of the pronouns.

But in an embedded environment, it is impossible to accommodate such a large library. This generation of embedded voice technology is based on a variety of statistical decision-making algorithms, selecting the most representative syllables from a large corpus system and preserving and reducing other candidate syllables. On the one hand, it reduces the size of the system, and on the other hand guarantees a better synthesis effect to a certain extent.

The disadvantage of large corpus is that if the synthetic effect is to be improved continuously, the number of syllable candidate units in the corpus is increased, and the size of the system is increasing.

The third generation of embedded speech synthesis engine (2005): size reduction, effect enhancement:

In order to further improve the synthesis effect and not be limited by the size of the system, HKUST has been developing research into the third generation speech synthesis technology. This generation of technology has greatly improved compared with the second generation. The main manifestations are as follows: first, the natural degree is improved, the synthetic effect is better and more practical. Secondly, the synthesis can be more widely adjusted, for example, speed regulation and intonation adjustment, etc.; third, the system size is smaller, the processor resource is occupied less, and the application of the embedded environment is more suitable.

The development of the third generation technology makes the effect of the embedded speech synthesis greatly improved, which brings the opportunity to the large-scale commercial application of the embedded voice technology, and shows the bright future of the speech synthesis technology. We can feel the effect of the third generation of speech synthesis technology in the audio electronic book products. It also represents the world's highest level of embedded Chinese voice synthesis technology.

Responsible editor (nine core electronics)

LiveZilla Live Chat Software