Invited speakers

Modeling speech intelligibility in adverse conditions

Torsten Dau, Danish Technical UniversitySlides

In everyday life, the speech we listen to is often mixed with many other sound sources as well as reverberation. In such situations, people with normal hearing are able to almost effortlessly segregate a single voice out of the background. In contrast, hearing-impaired people have great difficulty understanding speech when more than one person is talking, even when reduced audibility has been fully compensated for by a hearing aid. The reasons for these difficulties are not well understood. This presentation highlights recent concepts of the monaural and binaural signal processing strategies employed by the normal as well as impaired auditory system. Jørgensen and Dau [(2011). J. Acoust. Soc. Am. 130, 1475-1487] proposed the speech-based envelope power spectrum model (sEPSM) in an attempt to overcome the limitations of the classical speech transmission index (STI) and speech intelligibility index (SII) in conditions with nonlinearly processed speech. Instead of considering the reduction of the temporal modulation energy as the intelligibility metric, as assumed in the STI, the sEPSM applies the signal-to-noise ratio in the envelope domain (SNRenv). This metric was shown to be the key for predicting the intelligibility of reverberant speech as well as noisy speech processed by spectral subtraction. However, the sEPSM cannot account for speech subjected to phase jitter, a condition in which the spectral structure of speech is destroyed, while the broadband temporal envelope is kept largely intact. In contrast, the effects of this distortion can be predicted successfully by the spectro-temporal modulation index (STMI) [Elhilali et al., (2003). Speech Commun. 41, 331-348], which assumes an explicit analysis of the spectral modulation energy. However, since the STMI applies the same decision metric as the STI, it fails to account for spectral subtraction. The results from the different modeling approaches suggest that the SNRenv might be a key decision metric while some explicit across-frequency pre-processing seems crucial to extract relevant speech features in some conditions.

The effect of speaker - listener interaction on speech production in adverse listening conditions

Valerie Hazan, Dept of Speech, Hearing and Phonetic Sciences, UCL, London, UKSlides

Speakers interacting with interlocutors who are having difficulty understanding them due to a hearing impairment or adverse listening conditions need to make adaptations to their speech to maintain effective communication despite the adverse environment. At the same time, according to Lindblom's Hyper-Hypo model of Speech Production (Lindblom, 1990), talkers will tend to keep articulatory effort to the minimum level needed for effective communication. In our recent study, we investigated the adaptations that speakers make in such situations. The LUCID corpus (Baker and Hazan, 2011) includes dialogs produced by 40 speakers of SSBE while resolving a set of 'spot the difference' picture tasks in different communicative conditions. In some, the two talkers carrying out the task could hear each other normally while in others, one talker heard the other via (a) a three-channel noise‐excited vocoder (i.e. a simulation of a cochlear implant), (b) multibabble noise or (3) a language barrier (L2 speaker). Crucially, we analysed the speech of the talker in the pair who was hearing normally, but who had to adapt his or her speech in response to the listening difficulties of their interlocutor. Analyses were made of global and segmental acoustic-phonetic measures as well as measures of lexical variety and communication efficiency. The communication barriers elicited perceptually-clearer speech in the talker not directly experiencing the interference, and the adaptations made varied with the type of communication barrier that the interlocutor was experiencing. Correlations in clarity ratings for samples of spontaneous speech produced in the different conditions suggest that talkers' ranking in terms of their inherent clarity persists across speaking styles. However, weak correlations between acoustic-phonetic measures and measures of communication efficiency in adverse conditions suggest that talkers used a range of strategies to clarify their speech. These data provide further evidence that speech production is finely attuned to the needs of the interlocutor and that much is to be gained by analysing speech produced with communicative intent.

Speech Intelligibility Improvement using a Perceptual Distortion Measure

Richard Heusdens, Delft University of TechnologySlides

In this talk we present a speech pre-processing algorithm to improve the speech intelligibility in noise for the near-­‐end listener. The algorithm improves the intelligibility by optimally redistributing the speech energy over time and frequency for a perceptual distortion measure, which is based on a spectro-temporal auditory model.

Perceptual models exploiting auditory masking are frequently used in audio and speech processing applications like coding and watermarking. In most cases, these models only take into account spectral masking in short-time frames. As a consequence, undesired audible artifacts in the temporal domain may be introduced (e.g., pre-echoes). In this talk we discuss a new low-complexity spectro-temporal distortion measure. The model facilitates the computation of analytic expressions for masking thresholds, while advanced spectro-temporal models typically need computationally demanding adaptive procedures to find an estimate of these masking thresholds. We show that the proposed method gives similar masking predictions as an advanced spectro-temporal with only a fraction of its computational power. This auditory model can be used to improve speech intelligibility. Since it takes into account short-time information, transients will receive more amplification compared to stationary vowels, which is beneficial for improving intelligibility in noise. The proposed method is compared to the noisy unprocessed speech and two reference methods by means of an intelligibility listening test. The results show that the proposed method leads to a statistically significant improved speech intelligibility and, at the same time, to improved speech quality compared to the noisy speech

Biography Richard Heusdens
Richard Heusdens received the M.Sc. and Ph.D. degrees from Delft University of Technology, Delft, The Netherlands, in 1992 and 1997, respectively. Since 2002, he has been an Associate Professor in the Department of Intelligent Systems, Delft University of Technology. In the spring of 1992, he joined the digital signal processing group at the Philips Research Laboratories, Eindhoven, The Netherlands. He has worked on various topics in the field of signal processing, such as image/video compression and VLSI architectures for image processing algorithms. In 1997, he joined the Circuits and Systems Group of Delft University of Technology, where he was a Postdoctoral Researcher. In 2000, he moved to the Information and Communication Theory (ICT) Group, where he became an Assistant Professor responsible for the audio/speech signal processing activities within the ICT group. He held visiting positions at KTH (Royal Institute of Technology, Sweden) in 2002 and 2008. He is involved in research projects that cover subjects such as audio and speech coding, speech enhancement, signal processing for digital hearing aids, distributed signal processing and sensor networks.

Articulation in the presence of noise

James Johnston, DTS Inc.

This talk will introduce the basic issues in articulation (i.e. understanding speech) in the presence of noise. First, the presentation will provide a quick overview of the psychology, which will point out differences in articulation that can arise due to attention or expectation. Then, a quick mention of auditory masking, combined with auditory filtering, will explain what is necessary to get the peaks in human speech (the formants) above the masking level in a fashion that the auditory system and brain can filter them out of the background. Finally, a discussion of binaural hearing, and how onset helps with disambiguating sounds in the presence of noise, will complete the talk. Do to the broad nature of the particular subject, the talk will be at a higher, more conceptual level, addressing concepts rather than extensive details.

A new dimension of voice quality manipulation

Hideki Kawahara, Wakayama UniversitySlides

Voice quality plays important roles in communication. It provides additional communication channels for non- and paralinguistic information. Extension of a multi-aspect temporally variable morphing framework based on TANDEM-STRAIGHT, by introducing a new texture-related parameters, enables modification of this communication channels while preserving naturalness of the original speech. This talk starts from a brief summary of current state of STRAIGHT and then, introduces finer analysis and synthesis procedures of excitation source related parameters which are basis of this extension. Signal processing aspects such as new instantaneous frequency representation, F0 extraction with higher temporal resolution, parametric representation of aperiodic components and higher statistical aspect such as sound texture provide basis and motivation of this extension.

Some insights into talker-listener-environment coupling, energetics and the contrastive particulate structure of spoken language

Roger Moore, University of SheffieldSlides  |  Audio

Back in the early 1980s, my small research group at the Royal Signals and Radar Establishment (consisting of myself, Martin Russell and Mike Tomlinson) were investigating advanced forms of dynamic time warping (DTW). In particular, we were studying the detailed spectro-temporal relationships revealed by DTW between different versions of the same and contrasting utterances, and we came up with two key techniques for modelling the patterning that we observed: 'timescale variability analysis' (TVA) and 'discriminative networks' (DNs). TVA was a forerunner of what became widely known as 'duration modelling', and DNs were effectively a very early form of sub-word modelling. All of the research was conducted using speech that had been parameterised using the front-end of a military-specification channel vocoder (effectively a 27 channel filter bank). The vocoder not only provided the advantage of real-time speech analysis (so we were able to build real-time ASR systems), but it also offered the bonus that any manipulated speech patterns could be replayed through the channel vocoder synthesiser – and that is what we did on a regular basis, not to generate speech per se, but simply to understand the pattern structures that were embedded in our early statistical models. We were therefore quite surprised when we discovered that we could vary the generated output along various continua automatically such that one word could be transformed to sound like another or, much more interestingly, that a word could be transformed to sound less like another, and that the latter manipulations sounded clearer (as if the speaker was making more effort)! We had, of course, stumbled across a practical demonstration of what Bjorn Lindblom would subsequently publish as his theory of hyper and hypo speech (H&H).

Since those early days, I have held a continuing belief that talkers actively manage their speech production to suit the communicative context (including the listener(s) and the environment), and that such teleological behaviour was the source of much unexplained variability. It is for this reason that Lindblom's H&H theory figures strongly in my 'predictive sensorimotor control and emulation' (PRESENCE) model of speech in which I introduce the concept of 'reactive speech synthesis' – a synthesiser that dynamically adjusts its output as a function of the perceived effect on the listener.

In this talk I will discuss my current thinking in this area, touching on Robin Hofe's investigation into H&H using 'AnTon' (his animatronic tongue and vocal tract) and Mauro Nicolao's research into 'speech synthesis by analysis', but also speculating about (i) the wider implications of dynamic coupling between talkers, listeners and their communicative environments, (ii) the fundamental role that energetics plays in conditioning the behaviour of living systems, and (iii) the special consequences for the evolution of a high information-rate low degree-of-freedom system such as spoken language.

  • Russell, M. J., Moore, R. K., & Tomlinson, M. J. (1983). Some techniques for incorporating local timescale variability information into a dynamic time warping algorithm for automatic speech recognition, IEEE Int. Conf. on Acoustics, Speech and Signal Processing. Boston.
  • Moore, R. K., Russell, M. J., & Tomlinson, M. J. (1983). The discriminative network; a mechanism for focusing recognition in whole-word pattern matching, IEEE Int. Conf. on Acoustics, Speech and Signal Processing. Boston.
  • Moore, R. K. (2007). Spoken language processing: piecing together the puzzle. Speech Communication, 49, 418-435.
  • Moore, R. K. (2007). PRESENCE: A human-inspired architecture for speech-based human- machine interaction. IEEE Trans. Computers, 56(9), 1176-1188.
  • Hofe, R., & Moore, R. K. (2008). Towards an investigation of speech energetics using 'AnTon': an animatronic model of a human tongue and vocal tract. Connection Science, 20(4), 319–336.
  • Moore, R. K., & Nicolao, M. (2011). Reactive speech synthesis: actively managing phonetic contrast along an H&H continuum, 17th International Congress of Phonetics Sciences (ICPhS). Hong Kong.

An integrated theory of language production and comprehension

Martin Pickering, University of Edinburgh

Current accounts of language processing treat production and comprehension as quite distinct. I reject this dichotomy. In its place, I propose that producing and understanding are tightly interwoven, and this interweaving underlies people's ability to predict themselves and each other. Based on accounts of action, action perception, and joint action in which action and perception are interwoven to support prediction, I develop analogous accounts of production, comprehension and interactive language. Specifically, I propose that people predict their own utterances at different levels of representation (semantics, syntax, and phonology), and that they covertly imitate and predict their partner's utterances.

Listening Enhancement for Mobile Phones - How to Improve the Intelligibility in a Noisy Environment

Bastian Sauert and Peter Vary, Aachen University, GermanySlides

Mobile telephony is often conducted in the presence of strong acoustical background noise such as traffic or babble noise. In this situation, the near-end listener perceives a mixture of the clean far-end (downlink) speech and the acoustical background noise from the near-end and thus experiences an increased listening effort and a possibly reduced speech intelligibility.

While the acoustical background noise signal cannot be influenced, the received clean far-end speech signal can be manipulated by signal processing techniques for reducing the listening effort and for improving the speech intelligibility. We call this approach near‐end listening enhancement.

A reasonable objective optimization criterion is to maximize the Speech Intelligibilty Index (SII). The optimization has to take into account constraints arising from the underlying psychoacoustical model of perception and from the limitations of small loudspeakers. The optimization approach and the solutions will be presented.

Alternative time-domain and frequency-domain implementation structures with uniform and non-uniform spectral resolution will be discussed. The experimental setup using a dummy head will be described. Audio examples will be demonstrated.

Furthermore, the applicability in digital hearing aids, car radios, and in-car communication systems will be addressed.

HMM-based Speech Synthesis Adapted to Listeners' and Talkers' Conditions

Junichi Yamagishi, University of EdinburghSlides

It is known that the intelligibility of state-of-the-art hidden Markov model (HMM) generated synthetic speech can be comparable to natural speech in clean environments. However, the situation is quite different if the listener's and/or talker's condition differ. If the environment of the listener is noisy, most often natural speech is still more intelligible than synthetic speech. If the condition of the talker is disordered due to vocal disabilities such as neurological degenerative diseases, the talker's speech may be unintelligible even in clean environments.

In this talk, we introduce our recent approaches to these problems. To improve the intelligibility of synthetic speech in noise, we have proposed two promising approches based on statistical modelling and signal processing. In the former statistical modelling approach, we use speech waveforms and articulatory movements recorded in parallel by electromagnetic articulography and try to create hyper-articulated speech from normal speech by manipulating articulatory movements predicted from HMM [1]. The latter signal processing approach is a new cepstral analysis and transformation method [2] based on an objective intelligibility measure for speech in noise, the Glimpse Proportion measure [3]. This new method aims to modify the spectral envelope of speech in order to increase the intelligibility of speech in noise by modifying the clean speech. Finally we mention other work, in which we create natural and intelligible synthetic voices even from disordered unintelligible speech of individuals suffering from motor neurone disease [4].

  • [1] Z-H. Ling, K. Richmond, J. Yamagishi, and R.-H. Wang "Integrating Articulatory Features into HMM-based Parametric Speech Synthesis," IEEE Audio, Speech, & Language Processing. vol.17 No.6 pp.1171-1185 August 2009
  • [2] C. Valentini-Botinhao, R. Maia, J. Yamagishi, S. King, and H. Zen, "Cepstral analysis based on the Glimpse proportion measure for improving the intelligibility of HMM-based synthetic speech in noise," Proc ICASSP 2012
  • [3] M.Cooke,"A glimpsing model of speech perception in noise," J. Acoust. Soc. Am., vol. 119, no.3, pp. 1562–1573, 2006.
  • [4] J. Yamagishi, C. Veaux, S. King and S. Renals, "Speech synthesis technologies for individuals with vocal disabilities: voice banking and reconstruction," invited review. Acoustical Science & Technology, vol. 33, pp.1-5, January 2012 http://www.jstage.jst.go.jp/browse/ast/33/1/_contents