Otto-von-Guericke-Universität Magdeburg


Eurospeech 2003 Geneva

Eurospeech Sept. 1-4, 2003 Geneva
Selected papers (by attendees MS and AW) Oct 29, 2003

There will be talks (use presentation material, 20 min + discussion) about the papers and background by the following listed KO members (see your acronym at the following abstracts). Thu Nov 13, 09:15h: SK, MS, MK, AW and Thu Nov 20, 09:15h: Mai, SB, Mam, EA
Selection MS:
OMoCa Aurora Noise Robustness on SMALL Vocabulary Databases (selected by AW)


Veronique Stouten, Katholieke Universiteit Leuven, Belgium

Hugo Van hamme, Katholieke Universiteit Leuven, Belgium

Kris Demuynck, Katholieke Universiteit Leuven, Belgium

Patrick Wambacq, Katholieke Universiteit Leuven, Belgium

Page: 17-20, Paper Number: 1110

Abstract: Maintaining a high level of robustness for Automatic Speech Recognition (ASR) systems is especially challenging when the background noise has a time-varying nature. We have implemented a Model-Based Feature Enhancement (MBFE) technique that not only can easily be embedded in the feature extraction module of a recogniser, but also is intrinsically suited for the removal of non-stationary additive noise. To this end we combine statistical models of the cepstral feature vectors of both clean speech and noise, using a Vector Taylor Series approximation in the power spectral domain. Based on this combined HMM, a global MMSE-estimateof the clean speech is then calculated. Because of the scalability of the applied models, MBFE is flexible and computationally feasible. Recognition experiments with this feature enhancement technique onthe Aurora2 connected digit recognition task showed significant improvements on the noise robustness of the HTK recogniser.


OMoCc - Speech Signal Processing I


A technique for high-accuracy tracking of formants or vocal tract resonances is presented in this paper using a novel nonlinear predictor and using a target-directed temporal constraint. The nonlinear predictor isconstructed from a parameter-free, discrete mapping function from the formant (frequencies and bandwidths) space to the LPC-cepstral space, with trainable residuals. We examine in this study the key role of vocal tract resonance targets in the tracking accuracy. Experimental results show that due to the use of the targets, the tracked formants in the consonantal regions (including closures and short pauses) of the speech utterance exhibit the same dynamic properties as for the vocalic regions, and reflect the underlying vocal tract resonances. The results also demonstrate the effectiveness of training the prediction-residual parameters and of incorporating the target-based constraint in obtaining high-accuracy formant estimates, especially for non-sonorant portions of speech.

SMoDa - Aurora Noise Robustness on LV Databases


In this paper, we analyze the results of the recent Aurora large vocabulary evaluations. Two consortia submitted proposals on speech recognition front ends for this evaluation: (1) Qualcomm, ICSI, and OGI (QIO), and (2) Motorola, France Telecom, and Alcatel (MFA). These front ends used a variety of noise reduction techniques including discriminative transforms, feature normalization, voice activity detection, and blind equalization. Participants used a common speech recognition engine to postprocess their features. In this paper, we show that the results of this evaluation were not significantly impacted by suboptimal recognition system parameter settings. Without any front end specific tuning, the MFA front end outperforms the QIO front end by 9.6% relative. With tuning, the relative performance gap increases to 15.8%. Both the mismatched microphone and additive noise evaluation conditions resulted in a significant degradation in performance for both front ends.


This paper is mainly focused on showing experimental results of a feature extraction algorithm that combines spectral noise reduction and nonlinear feature normalization.  The successfulness of this approach has been shown in a previous work, and in this one, we present several improvements that result in a performance comparable to that of the recently approved AFE for DSR. Noise reduction is now based on a Wiener filter instead of spectral subtraction. The voice activity detection based on the full-band energy has been replaced with a new one using spectral information. Relative improvements of 24.81% and 17.50% over our previous system are obtained for AURORA 2 and 3 respectively. Results for AURORA 2 are not as good as those for the AFE, but for AURORA 3 a relative improvement of 5.27% is obtained.

PMoDe - Speech Modeling and Features I


Linear discriminant analysis (LDA) in its original model-free formulation is best suited to classification problems with equal-covariance classes. Heteroscedastic discriminant analysis (HDA) removes this equal covariance constraint, and therefore is more suitable for automatic speech recognition (ASR) systems. However, maximizing HDA objective function does not correspond directly to minimizing the recognition error. In its original formulation, HDA solves a maximum likelihood estimation problem in the original feature space to calculate the HDA transformation matrix. Since the dimension of the original feature space in ASR problems is usually high, the estimation of the HDA transformation matrix becomes computationally expensive and requires a large amount of training data. This paper presents a generalization of LDA that solves these two problems.  We start with showing that the calculation of the LDA projection matrix is a maximum mutual information estimation problem in the lower-dimensional space with some constraints on the model of the joint conditional and unconditional probability density functions (PDF) of the features, and then, by relaxing these constraints, we develop a dimensionality reduction approach that maximizes the conditional mutual information between the class identity and the feature vector in the lower-dimensional space given the recognizer model.  Using this approach, we achieved 1% improvement in phoneme recognition accuracy compared to the baselinesystem.  Improvement in recognition accuracy compared to both LDA and HDA approaches is also achieved.


Panu Somervuo, International Computer Science Institute, USA

Barry Chen, International Computer Science Institute, USA

Qifeng Zhu, International Computer Science Institute, USA

Page: 477-480, Paper Number: 439

Abstract: In this work, linear and nonlinear feature transformations have been experimented in ASR front end. Unsupervised transformations were based on principal component analysis and independent component analysis. Discriminative transformations were based on linear discriminant analysis and multilayer perceptron networks. The acoustic models were trained using a subset of HUB5 training data and they were tested using OGI Numbers corpus.  Baseline feature vector consisted of PLP cepstrum and energy with first and second order deltas. None of the feature transformations could outperform the baseline when used alone, but improvement in the word error rate was gained when the baseline feature was combined with the feature transformation stream. Two combination methods were experimented: feature vector concatenation and n-best list combination using ROVER. Best results were obtained using the combination of the baseline PLP cepstrum and the feature transform based on multilayer perceptron network. The word error rate in thenumber recognition task was reduced from 4.1 to 3.1.


STuBb - Forensic Speaker Recognition


The most prominent part in forensic speech and audio processing is speaker recognition. In the world a number of approaches to forensic speaker recognition (FSR) have been developed, that are different in terms of technical procedures, methodology, instrumentation and also in terms of the probability scale on which the final conclusion inbased. The BKA's approach to speaker recognition is a combination of classical phonetic analysis techniques including analytical listening by an expert and the use of signal processing techniques within an acoustic-phonetic framework. This combined auditory-instrumental methodincludes acoustic measurements of parameters which may be interpreted using statistical information on their distributions, e.g. probability distributions of average fundamental frequency for adult males and females, average syllable rates as indicators of speech rate, etc. In a voice comparison report the final conclusion is determined by a synopsis of the results from auditory and acoustic parameters, amounting to about eight to twelve on average, depending on the nature of the speech material. Results are given in the form of probability statements. The paper gives an overview of current procedures and specific problems of FSR.

PTuBg - Speech Modeling and Features II


In this paper, we explore modern methods and algorithms from fractal/chaotic systems theory for modeling speech signals in a multidimensional phase space and extracting characteristic invariant measures like  generalized fractal dimensions and  Lyapunov exponents. Such measures can capture valuable information for the characterisation of the multidimensional phase space -- which is closer to the true dynamics -- since they are sensitive to the frequency with which the attractor visits different regions and the rate of exponential divergence of nearby orbits, respectively. Further we examine the classification capability of related nonlinear features over broad phoneme classes. The results of these preliminary experiments indicate that the information carried by these novel nonlinear feature sets is important and useful.


Feature representation is a very important factor that has great effect on the performance of speech recognition systems. In this paper we focus on a feature generation process that is based on linear transformation of the original log-spectral representation. We first discuss several three popular linear transformation methods, Mel-Frequency Cepstral Coefficients (MFCC), Principal Component Analysis (PCA), and Linear Discriminant Analysis (LDA). We then propose a new method of linear transformation that maximizes the normalized acoustic likelihood of the most likely state sequences of training data, a measure that directly related to our ultimate objective of reducing Bayesian classification error rate in speech recognition.  Experimental results show that the proposed method decreases the relative word error rate by more than 8.8% compared to the best implementation of LDA, and by more than 25.9% compared to MFCC features.

PTuBh - Topics in Speech Recognition & Segmentation


We address the problem of estimating the word error rate (WER) of an automatic speech recognition (ASR) system without using acoustic test data. This is an important problem which is faced by the designers of new applications which use ASR. Quick estimate of WER early in the design cycle can be used to guide the decisions involving dialog strategy and grammar design. Our approach involves estimating the probability distribution of the word hypotheses produced by the underlying ASR system given the text test corpus. A critical component of this system is a phonemic confusion model which seeks to capture the errors made by ASR on the acoustic data at a phonemic level. We use a confusion model composed of probabilistic phoneme sequence conversion rules which are learned from phonemic transcription pairs obtained by leave-one-out decoding of the training set. We show reasonably close estimation of WER when applying the system to test sets from different domains. OTuCa - Robust Speech Recognition - Acoustic Modeling



In this paper, we describe automatic speech recognition system where features extracted from human speech production system in form of articulatory movements data are effectively integrated in the acoustic model for improved recognition performance. The system is based on the hybrid HMM/BN model, which allows for easy integration of different speech features by modeling probabilistic dependencies between them. In addition, features like articulatory movements, which are difficult or impossible to obtain during recognition, can be left hidden, in fact eliminating the need of their extraction. The system was evaluated in phoneme recognition task on small database consisting of three speakers' data in speaker dependent and multi-speaker modes. In both cases, we obtained higher recognition rates compared to conventional, spectrum based HMM system with the same number of parameters.


STuCb - Advanced Machine Learning Algorithms for Speech & Language Processing


Kernel methods have found in recent years wide use in statistical learning techniques due to their good performance and their computational efficiency in high-dimensional feature space. However, text or speech data cannot always be represented by the fixed-length vectors that the traditional kernels handle. We recently introduced a general kernel framework based on weighted transducers,  rational kernels, to extend kernel methods to the analysis of variable-length sequences and weighted automata [5] and described their application to spoken-dialog applications. We presented a constructive algorithm for ensuring that rational kernelsare  positive definite symmetric, a property which guarantees the convergence of discriminant classification algorithms such as Support Vector Machines, and showed that many string kernels previously introduced in the computational biology literature are special instances of such positive definite symmetric rational kernels [4].  This paper reviews the essential results given in [5, 3, 4] and presents them in the form of a short tutorial.


Label sequence learning is the problem of inferring a state sequence from an observation sequence, where the state sequence may encode a labeling, annotation or segmentation of the sequence.  In this paper we give an overview of discriminative methods developed for this problem. Special emphasis is put on large margin methods by generalizing multiclass Support Vector Machines and AdaBoost to the case of label sequences. An experimental evaluation demonstrates the advantages over classical approaches like Hidden Markov Models and the competitiveness with mehods like Conditional Random Fields.


Nonnegativity constraints arise frequently in statistical learning and pattern recognition. Multiplicative updates provide natural solutions to optimizations involving these constraints. One well known set ofmultiplicative updates is given by the Expectation- Maximization algorithm for hidden Markov models, as used in automatic speech recognition. Recently, we have derived similar algorithms for nonnegative deconvolution and nonnegative quadratic programming. These algorithms have applications to low-level problems in voice processing, such as fundamental frequency estimation, as well as high-level problems, such as the training of large margin classifiers. In this paper, we describe these algorithm and the ideas that connect them.

OTuCc  - Speech Modeling and Features III


In speech recognition systems, feature extraction can be achieved in two steps: parameter extraction and feature transformation.  Feature transformation is an important step. It can concentrate the energy distributions of a speech signal onto fewer dimensions than those of parameter extraction and thus reduce the dimensionality of the system. Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are the two popular feature transformation methods. This paper investigates their performances in dimensionality reduction tasks in continuous speech recognition systems. A new type of feature transformation, LP transformation, is proposed and its performance is compared to those of LDA and PCA transformations.


PTuCf: Speech Recognition, Search and Language Modelling (selected by AW)


S. Kanthak, RWTH Aachen, Germany

Hermann Ney, RWTH Aachen, Germany

Page: 1145-1148, Paper Number: 544

Abstract:In this paper we combine grapheme-based sub-word units with multilingual acoustic modeling. We show that a global decision tree together with automatically generated grapheme questions eliminate manual effort completely. We also investigate the effects of additional language questions.   We present experimental results on four corpora with different languages, namely the Dutch and French ARISE corpus, the Italian EUTRANS corpus and the German VERBMOBIL corpus. Graphemes are shown to give good coverage on all four languages and represent a large set of shared sub-word models. For all experiments, the acoustic models are trained from scratch in order not to use any prior phonetic knowledge. Finally, we show that for the Dutch and German tasks, the presented approach works well and may also help do decrease the word error rate below that obtained by monolingual acoustic models. For all four languages, adding language questions to the multilingual decision tree helps to improve the word error rate.

OTuDa - Robust Speech Recognition - Front-end Processing


Speech recognition accuracy degrades significantly when the speech has been corrupted by noise, especially when the system has been trained on clean speech. Many compensation algorithms have been developed which require reliable online noise estimates or  a priori knowledge of the noise. In situations where such estimates or knowledge is difficult to obtain, these methods fail. We present a new robustness algorithm which avoids these problems by making no assumptions about the corrupting noise.  Instead, we exploit properties inherent to the speech signal itself to denoise the recognition features. In this method, speech is decomposed into harmonic and noise-like components, which are then processed independently and recombined. By processing noise-corrupted speech in this manner we achieve significant improvements in recognition accuracy on the Aurora 2 task.

Selection AW:
OWeBa-Oral - Speech Recognition - Adaptation II


Driss Matrouf, LIA-CNRS, France

Olivier Bellot, LIA-CNRS, France

Pascal Nocera, LIA-CNRS, France

Georges Linares, LIA-CNRS, France

Jean-Francois Bonastre, LIA-CNRS, France

Page: 1625-1628, Paper Number: 937

Abstract: Within the framework of speaker-adaptation, a technique based on tree structure and the maximum a posteriori criterion was proposed (SMAP). In SMAP, the parameters estimation, at each node in the tree is based on the assumption that the mismatch between the training and adaptation data is a Gaussian PDF which parameters are estimated by using the Maximum Likelihood criterion.  To avoid poor transformation parameters estimation accuracy due to an insufficiency of adaptation data in a node, we propose a new technique based on the maximum a posteriori approach and PDF Gaussians Merging.  The basic idea behind this new technique is to estimate an affine transformations which bring the training acoustic models as close as possible to the test acoustic models rather than transformation maximizing the likelihood of the adaptation data. In this manner, even with very small amount of adaptation data, the parameters transformations are accurately estimated for means and variances.  This adaptation strategy has shown a significant performance improvement in a large vocabulary speech recognition task, alone and combined with the MLLR adaptation.

PWeBa- Speech Signal Processing II


Yoshinori Shiga, University of Edinburgh, U.K.

Simon King, University of Edinburgh, U.K.

Page: 1749-1752, Paper Number: 512

Abstract: This paper presents a new approach for estimating voice source and vocal tract filter characteristics of voiced speech. When it is required to know the transfer function of a system in signal processing, the input and output of the system are experimentally observed and used to calculate the function. However, in the case of source-filter separation we deal with in this paper, only the output (speech) is observed and the characteristics of the system (vocal tract) and the input (voice source) must simultaneously be estimated. Hence the estimate becomes extremely difficult, and it is usually solved approximately using oversimplifiedmodels. We demonstrate that these characteristics are separable under the assumption that they are independently controlled by different factors. The separation is realised using an iterative approximation along with the Multi-frame Analysis method, which we have proposed to find spectral envelopes of voiced speech with minimum interference of the harmonic structure.


PWeBf-- Robust Speech Recognition  II


Rita Singh, Carnegie Mellon University, USA

Manfred K. Warmuth, University of California at Santa Cruz, USA

Bhiksha Raj, Mitsubishi Electric Research Laboratories, USA

Paul Lamere, Sun Microsystems Laboratories, USA

Page: 1773-1776, Paper Number: 82

Abstract: In this paper we describe a generalized classification method for HMM-based speech recognition systems, that uses free energy as a discriminant function rather than conventional probabilities.  The discriminantfunction incorporates a single adjustable temperature parameter T. The computation of free energy can be motivated using an entropy regularization, where the entropy grows monotonically with the temperature. In the resulting generalized classification scheme, the values of T = 0 and T = 1 give the conventional Viterbi and forward algorithms, respectively,as special cases. We show experimentally that if the test data are mismatched with the classifier, classification at temperatures higher than one can lead to significant improvements in recognition performance.The temperature parameter is far more effective in improving performance on mismatched data than a variance scaling factor, which is another apparent single adjustable parameter that has a very similar analytical form.

OWeCa--Speech Recognition  - Large Vocabulary  II


Johan Schalkwyk, SpeechWorks International, USA

Lee Hetherington, Massachusetts Institute of Technology, USA

Ezra Story, SpeechWorks International, USA

Page: 1969-1972, Paper Number: 632

Abstract: Spoken language systems, ranging from interactive voice response (IVR) to mixed-initiative conversational systems, make use of a wide range of recognition grammars and vocabularies.  The recognition grammars are either static (created at design time) or dynamic (dependent on database lookup at run time).  This paper examines the compilation of recognition grammars with an emphasis on the dynamic (changing)properties of the grammar and how these relate to context-dependent speech recognizers.  By casting the problem in the algebra of finite-state transducers (FSTs) we can use the composition operator for fast-and-efficient compilation and splicing of dynamic recognition grammars within the context of a larger precompiled static grammar.



D. Povey, Cambridge University, U.K.

M.J.F. Gales, Cambridge University, U.K.

D.Y. Kim, Cambridge University, U.K.

P.C. Woodland, Cambridge University, U.K.

Page: 1981-1984, Paper Number: 1024

Abstract: This paper investigates the use of discriminative schemes based on the maximum mutual information (MMI) and minimum phone error (MPE) objective functions for both task and gender adaptation.  A method for incorporating prior information into the discriminative training framework is described. If an appropriate form of prior distribution is used, then this may be implemented by simply altering the values of the counts used for parameter estimation. The prior distribution can be based around maximum likelihood parameter estimates, giving a technique known as I-smoothing, or for adaptation it can be based around a MAP estimate of the ML parameters, leading to MMI-MAP, or MPE-MAP. MMI-MAP is shown to be effective for task adaptation, where data from one task (Voicemail) is used to adapt a HMM set trained on another task (Switchboard). MPE-MAP is shown to be effective for generating gender-dependent models for Broadcast News transcription.

PWeDg-Acoustic Modelling I


Karen Livescu, Massachusetts Institute of Technology, USA

James Glass, Massachusetts Institute of Technology, USA

Jeff Bilmes, University of Washington, USA

Page: 2529-2532, Paper Number: 1082

Abstract: In this paper, we investigate the use of dynamic Bayesian networks (DBNs) to explicitly represent models of hidden features, such as articulatory

or other phonological features, for automatic speech recognition. In previous work using the idea of hidden features, the representation has typically been implicit, relying on a single hidden state to represent  a combination of features. We present a class of DBN-based hidden feature models, and show that such a representation can be not only more expressive but also more parsimonious. We also describe a way of representing the acoustic observation model with fewer distributions using a product of models, each corresponding to a subset of the features. Finally, we describe our recent experiments using hidden feature models on the Aurora 2.0 corpus.

OThBd-Acoustic Modelling II


Karthik Visweswariah, IBM T.J. Watson Research Center, USA

Scott Axelrod, IBM T.J. Watson Research Center, USA

Ramesh Gopinath, IBM T.J. Watson Research Center, USA

Page: 2613-2616, Paper Number: 673

Abstract: Gaussian distributions are usually parameterized with their natural parameters: the mean (mu) and the covariance (Sigma). They can also be re-parameterized as exponential models with canonical parameters P = (Sigma) ^-1 and (psi) = P(mu). In this paper we consider modeling acoustics with mixtures of Gaussians parameterized with canonical parameters where the parameters are constrained to lie in a shared affine subspace. This class of models includes Gaussian models with various constraints on its parameters: diagonal covariances, MLLT models, and the recently proposed EMLLT and SPAM models. We describe how to perform maximum likelihood estimation of the subspace and parameters within a fixed subspace. In speech recognition experiments, we show that this model improves upon all of the above classes of models with roughly the same number of parameters and with little computational overhead. In particular we get 30-40% relative improvement over LDA+MLLT models when using roughly the same number of parameters.



On the Use of Kernel PCA for Feature Extraction in Speech Recognition

Amaro Lima, Nagoya Institute of Technology, Japan

Heiga Zen, Nagoya Institute of Technology, Japan

Yoshihiko Nankaku, Nagoya Institute of Technology, Japan

Chiyomi Miyajima, Nagoya Institute of Technology, Japan

Keiichi Tokuda, Nagoya Institute of Technology, Japan

Tadashi Kitamura, Nagoya Institute of Technology, Japan

Page: 2625-2628, Paper Number: 565

Abstract: This paper describes an approach for feature extraction in speech recognition systems using kernel principal component analysis (KPCA). This approach consists in representing speech features as the projection of the extracted speech features mapped into a feature space via a nonlinear mapping onto the principal components. The nonlinear mapping is implicitly performed using the kernel-trick, which is an useful way of not mapping the input space into a feature space explicitly, making this mapping computationally feasible. Better results were obtained by using this approach when compared to the standard technique.

PThCf-- Robust Speech Recognition  IV

Mai, SB

Michael J. Carey, University of Bristol, U.K.

Page: 3045-3048, Paper Number: 124

Abstract: A new simple but robust method of front-end analysis, nonlinear spectral smoothing (NLSS), is proposed. NLSS uses rank-order filtering to replace noisy low-level speech spectrum coefficients with values computed from adjacent spectral peaks. The resulting transformation bears significant similarities with masking in the auditory system. It can be used as an intermediate processing stage between the FFT and the filter-bank analyzer. It also produces features which can be cosine transformed and used by a pattern matcher. NLSS gives significant improvements in the performance of speech recognition systems in the presence of stationary noise, a reduction in error rate of typically 50% or an increased tolerance to noise of 3dB for the same error rate in an isolated digit test on the Noisex database. Results on female speech were superior to those on male speech: female speech gave a recognition error rate of 1.1% at a 0dB signal to noise ratio.



Yan Ming Cheng, Motorola Labs, USA

Chen Liu, Motorola Labs, USA

Yuan-Jun Wei, Motorola Labs, USA

Lynette Melnar, Motorola Labs, USA

Changxue Ma, Motorola Labs, USA

Page: 3121-3124, Paper Number: 614

Abstract: There is an increasing need to deploy speech recognition systems supporting multiple languages/dialects on portable devices worldwide. A common approach uses a collection of individual monolingual speech recognition systems as a solution. However, such an approach is not practical for handheld devices such as cell phones due to stringent restrictions on memory and computational resources. In this paper, we present a simple and effective method to develop multilingual acoustic models that achieve comparable performance relative to monolingual acoustic models but with only a fraction of the storage space of the combined monolingual acoustic model set.



Mirjam Killer, ETH Zurich, Switzerland

Sebastian Stuker, Universitat Karlsruhe, Germany

Tanja Schultz, Carnegie Mellon University, USA

Page: 3141-3144, Paper Number: 1061

Abstract: Large vocabulary speech recognition systems traditionally represent words in terms of subword units, usually phonemes. This paper investigates the potential of graphemes acting as subunits.  In order to develop context dependent grapheme based speech recognizers several decision tree based clustering procedures are performed and compared to each other. Grapheme based speech recognizers in three languages -- English, German, and Spanish - are trained and compared to their phoneme based counterparts. The results show that for languages with a close grapheme-to-phoneme relation, grapheme based modeling is as good as the phoneme based one. Furthermore, multilingual grapheme based recognizers are designed to investigate whether grapheme based information can be successfully shared among languages. Finally, some bootstrapping experiments for Swedish were performed to test the potential for rapid language deployment.

Letzte Änderung: 19.10.2011 - Ansprechpartner: Dipl.-Ing. Arno Krüger