LEARNING DEEP AND WIDE CONTEXTUAL REPRESENTATIONS USING BERT FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS

Abstract: In this paper, we propose a method of learning deep and wide contextual representations for statistical parametric speech synthesis (SPSS) using BERT, a pre-trained language representation model. Traditional acoustic models in SPSS utilize phoneme sequences and prosody labels as input, and can not make full use of the deep linguistic representations of current and surrounding sentences. Therefore, this paper designs two context encoders, i.e., a sentence-window context encoder and a paragraph-level context encoder, to integrate the contextual representations extracted from multiple sentences by BERT into Tacotron via an extra attention module. The parameters of BERT are pre-trained and then fine-tuned together with other components in the model. Experimental results on the Blizzard Challenge 2019 dataset show that both context encoders can reduce the errors of acoustic feature prediction and improve the subjective performance of synthetic speech comparing with the baseline Tacotron model.

(1) Demos of different models

The Blizzard Challenge 2019 dataset was adopted in our experiments. We use WaveNet vocoder to reconstruct waveforms. The following demos show natural speech and the generated speech of Baseline model, SW models and PL model in test set.

1. 都是这种情况，他们很难顺畅地读书，但是事业都非常成功。

natural speech Baseline SW(K=0) SW(K=1) PL

2. 说这事啊，当时演讲现场是一片不太相信的表情，觉得这是小儿科的想象。

natural speech Baseline SW(K=0) SW(K=1) PL

3. 这个时候丘吉尔说了一句名言，这不是结束甚至不是结束的开始，而可能只是开始的结束。

natural speech Baseline SW(K=0) SW(K=1) PL

4. 因为这件事的长期影响力就要开始发挥了嘛。

natural speech Baseline SW(K=0) SW(K=1) PL

(2) Compare different context in SW(K=1) model

In Session 4.4, in order to further analysis the effects of the SW(K=1) model on utilizing contextual sentences, three different kinds of context input were compared by experiments. The sentence in red is target sentence and sentences in black are contextual sentences. The following demos correspond to Fig. 2 and Fig. 3 in the paper.

1) demos

1. true contexts : 规矩的作用啊本质上是降低人和人之间连接的成本的。那什么时候有机会打破规矩呢，就是三种情况啊，第一种，连接的成本已经很低，啊比如在家里，两口子之间立的那些规矩，他就不容易长期保持，所谓清官难断家务事也就是这个原因。 (English translation: The role of rules is essentially to reduce the cost of connection among people. When will there be a chance to break the rules? There are three situations. First, the cost of connection is already very low. For example, the rules established between husband and wife are not easy to maintain for a long time. This is so-called that even an upright official finds it hard to settle a family quarrel. )

mismatched contexts : 但伊凡的一番话让格伦放了心，并给安娜一本魔法书，格伦放心的把安娜交给伊凡，安娜加入队伍。那什么时候有机会打破规矩呢，就是三种情况啊，第一种，连接的成本已经很低，其实，这些语法掌握了也有助于阅读，用在作文中也同样精彩。 (English translation: But Evan’s words reassured Glen and Evan gave Anna a magic book. Glen allowed that Anna could go with Evan and Anna joined the team. When will there be a chance to break the rules? There are three situations. First, the cost of connection is already very low. In fact, mastering these grammars is also helpful for reading, and it is also beneficial to use them in composition.)

random contexts : 但以的好然果起容止人子怕本一差其但代没。益才了论上们这仄吃到必要物。有。那什么时候有机会打破规矩呢，就是三种情况啊，第一种，连接的成本已经很低，骚，门业时话斯企看一广反它年啊有等因利具被个了识专无。到了是专奈个都不。 (English translation: [random character sequences] When will there be a chance to break the rules? There are three situations. First, the cost of connection is already very low. [random character sequences]])

natural speech true contexts mismatched contexts random contexts

2) figures

The following figures show the average attention probabilities of one sentence which corresponding to Fig.2 in the paper. In these figures, the attention probabilities and their corresponding characters are shown one-on-one.

(a) true contexts

(b) mismatched contexts

Speech Demo

natural speech	true contexts	mismatched contexts	random contexts