4.6.2. Results
Figure 3 shows the boxplot of listeners responses in terms of the
rank order of systems for the female (top) and the male (bottom)
voice of vocoded (left) and synthetic (right) speech. The rank
order was obtained per screen and per listener according to the
scores given to each voice. The solid and dashed lines show me-
dian and mean values. To test significant differences we used a
Mann-Whitney U test at a p-value of 0.01 with a Homl Bonfer-
roni correction due to the large number of pairs to compare. The
pairs that were not found to be significantly different from each
other are connected with straight horizontal lines that appear on
the top of each boxplot.
As expected natural speech ranked highest and noise ranked
lowest for all cases. RNN-DFT was rated higher among all en-
hancement strategies in all cases. The gap between clean and
RNN-DFT enhanced speech is smaller for the synthetic speech
style than for the vocoded speech. In fact for both genders the
synthetic voice trained with RNN-DFT enhanced speech was
not found to be significantly different than the voice built with
clean speech. The increasing order of preference of the meth-
ods seem to be the same for vocoded and synthetic speech:
OMLSA, followed by RNN-V and RNN-DFT. The benefit of
RNN-based methods is seen in both vocoded and synthetic
voices, while the OMLSA method improvements seems to de-
crease after TTS acoustic model training.
5. Discussion
We have found that the reconstruction process required in the
RNN-DFT method does not seem to negatively impact the ex-
traction of TTS acoustic features from noisy data. However we
observed that the RNN-DFT method increases both MCEP and
BAP distortion more than the RNN-V method. The assumption
that phase can be reconstructed directly from the noisy speech
data may have caused an increase in distortion. RNN-DFT
seems however to decrease V/UV and F
0
errors when compared
to RNN-V. This is somewhat unexpected as the RNN-V ap-
proach directly enhances the F
0
data. Both methods decreased
MCEP distortion for all noises tested, making the gap between
non-stationary and stationary noises smaller.
We argued in [12] that enhancing the acoustic parameters
that are used for TTS model training should generate higher
quality synthetic voices but subjective scores showed that RNN-
DFT resulted in higher quality vocoded and synthetic speech
for both genders. The RNN-DFT enhanced synthetic voice was
in fact ranked as high as the voice built using clean data. We
believe that RNN-V did not work as well because enhancing
the F
0
trajectory directly is quite challenging, as F
0
extraction
errors can be substantial in some frames (doubling and halving
errors) while small in others.
6. Conclusion
We presented in this paper two different speech enhancement
methods to improve the quality of TTS voices trained with noisy
speech data. Both methods employ a recursive neural network
to map noisy acoustic features to clean features. In one method
we train an RNN with acoustic features that are used to train
TTS models, including fundamental frequency and Mel cepstral
coefficients. In the other method the RNN is trained with pa-
rameters extracted from the magnitude spectrum, as is usually
done in conventional speech enhancement methods. For wave-
form reconstruction the phase information is directly obtained
from the original noise signal while the magnitude spectrum
is obtained using the output of the RNN. We have found that
although Mel cepstral distortion is higher the second method
was rated of a higher quality for both vocoded and synthetic
speech and for the female and male data. The synthetic voices
trained with data enhanced with this method were rated simi-
lar to voices trained with clean speech. In the future we would
like to investigate whether similar improvements would apply
to voices trained using DNNs and whether training an RNN di-
rectly with the magnitude spectrum could further improve re-
sults.
Acknowledgements This work was partially supported by EPSRC
through Programme Grant EP/I031022/1 (NST) and EP/J002526/1
(CAF) and by CREST from the Japan Science and Technology Agency
(uDialogue project). The full NST research data collection may be ac-
cessed at http://hdl.handle.net/10283/786.
7. References
[1] H. Zen, K. Tokuda, and A. W. Black, “Statistical parametric
speech synthesis,” Speech Comm., vol. 51, no. 11, pp. 1039–1064,
2009.
[2] J. Yamagishi, Z. Ling, and S. King, “Robustness of HMM-based
Speech Synthesis,” in Proc. Interspeech, Brisbane, Australia, Sep.
2008, pp. 581–584.
[3] J. Yamagishi, C. Veaux, S. King, and S. Renals, “Speech synthesis
technologies for individuals with vocal disabilities: Voice banking
and reconstruction,” J. of Acoust. Science and Tech., vol. 33, no. 1,
pp. 1–5, 2012.
[4] Y. Hu and P. C. Loizou, “Subjective comparison of speech en-
hancement algorithms,” in Proc. ICASSP, vol. 1, May 2006, pp.
I–I.
[5] Y. Wang and D. Wang, “A deep neural network for time-domain
signal reconstruction,” in Proc. ICASSP, April 2015, pp. 4390–
4394.
[6] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to
speech enhancement based on deep neural networks,” IEEE Trans.
on Audio, Speech and Language Processing, vol. 23, no. 1, pp. 7–
19, Jan 2015.
[7] K. Kinoshita, M. Delcroix, A. Ogawa, and T. Nakatani, “Text-
informed speech enhancement with deep neural networks,” in
Proc. Interspeech, Sep. 2015, pp. 1760–1764.
[8] F. Weninger, J. Hershey, J. Le Roux, and B. Schuller, “Discrimina-
tively trained recurrent neural networks for single-channel speech
separation,” in Proc. GlobalSIP, Dec 2014, pp. 577–581.
[9] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Roux, J. R.
Hershey, and B. Schuller, Proc. Int. Conf. Latent Variable Anal-
ysis and Signal Separation. Springer International Publishing,
2015, ch. Speech Enhancement with LSTM Recurrent Neural
Networks and its Application to Noise-Robust ASR, pp. 91–99.
[10] T. Toda and K. Tokuda, “A speech parameter generation algorithm
considering global variance for HMM-based speech synthesis,”
IEICE Trans. Inf. Syst., vol. E90-D, no. 5, pp. 816–824, 2007.
[11] R. Karhila, U. Remes, and M. Kurimo, “Noise in HMM-Based
Speech Synthesis Adaptation: Analysis, Evaluation Methods and
Experiments,” J. Sel. Topics in Sig. Proc., vol. 8, no. 2, pp. 285–
295, April 2014.
[12] C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi,
“Speech enhancement for a noise-robust text-to-speech synthe-
sis system using deep recurrent neural networks,” in Proc. Inter-
speech, (submitted) 2016.
[13] P. C. Loizou, Speech Enhancement: Theory and Practice, 1st ed.
Boca Raton, FL, USA: CRC Press, Inc., 2007.
[14] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term depen-
dencies with gradient descent is difficult,” IEEE Trans. on Neural
Networks, vol. 5, no. 2, pp. 157–166, 1994.
151