Otto-von-Guericke-Universität Magdeburg


Eurospeech 2001 Aalborg

A.  Wendemuth

Eurospeech - Highlights 2001

Magdeburg 15.01.01


Session A34 Speech Recognition with fragments


Bazzi I, Glass J

MIT Laboratory for Computer Science, USA

This paper describes our recent work on detecting and recognizing out-of- vocabulary (OOV) words for robust speech recognition and understanding. To allow for OOV recognition within a word-based recognizer, the in-vocabulary (IV) word network is augmented with an OOV model so that OOV words are considered simultaneously with IV words during recognition. We explore several configurations for the OOV model, the best of which utilizes a set of domain-independent, automatically derived, variable-length units. The units are created usin an iterative bottom-up procedure where, at each iteration, the unit pairs with maximum mutual information are merged. When evaluating this method on a weather information domain, the false alarm rate of our baseline OOV model is reduced by over 60%. For example, with an OOV detection rate of 70%, the OOV false alarm rate is reduced from 8.5% to 3.2%, with only 3% relative degradation in word error rate on IV data.

- train oov from large dictionary

- multi-phone units by MMI grouping


Kneissler J, Klakow D

Philips Forschungslaboratorien, Germany

This paper describes approaches for decomposing words of huge vocabularies (up to 2 million) into smaller particles that are suitable for a recognition lexicon. Results on a Finnish dictation task and a flat list of German street names are given. Volume 1, page 69 Session A34

- construction of fragments for LM

- proper cutting of words into fragments

- glueing fragments after recognition ; word boundaries

- using morphological constraints

- select optimal fragment subset by discrete gradient method


Session A41 - Oral & Poster Monday - 15.50 - 18.00

Noise Robust Recognition: Front-end and Compensation Algorithms on the Aurora Database


Zhu Q, Iseli M, Cui X, Alwan A

University of California, Los Angeles, USA

Four front-end processing techniques developed for noise robust speech recognition are tested with the Aurora 2 database. These techniques include three previously published algorithms: variable frame rate analysis [Zhu and Alwan, 2000], peak isolation [Strope and Alwan,1997], and harmonic demodulation [Zhu and Alwan, 2000], and a new technique for peak-to-valley ratio locking. Our previous work has focused on isolated digit recognition. In this paper, these algorithms are modified for recognition of connected digits. Recognition results with the Aurora 2 database show that a combination of these four techniques results in 40% error rate reduction when compared to the baseline MFCC front-end for the clean training condition, with no significant increase in computational complexity. Volume 1, page 185 Session A41


Ellis D P W, Reyes Gomez M J

Columbia University, USA

In tandem acoustic modeling, signal features are first processed by a discriminantly-trained neural network, then the outputs of this network are treated as the feature inputs to a conventional distribution-modeling Gaussian-mixture model speech recognizer. This arrangement achieves relative error rate reductions of 30% or more on the Aurora task, as well as supporting feature stream combination at the posterior level, which can eliminate more than 50% of the errors compared to the HTK baseline. In this paper, we explore a number of variations on the tandem structure: We experiment with changing the subword units used in each model (neural net and GMM), varying the data subsets used to train each model, substituting the posterior calculations in the neural net with a second GMM, and a variety of feature condition such as deltas, normalization and PCA rank reduction in the `tandem domain' i.e. between the two models. Volume 1, page 189 Session A41


Yao K, Chen J, Paliwal K K, Nakamura S

ATR Spoken Language Translation Research Labs., Japan

We have evaluated several feature-based and a model-based method for robust speech recognition in noise. The evaluation was performed on Aurora 2 task. We show that after a sub-band based spectral subtraction, features can be more robust to additive noise. We also report a robust feature set derived from differential power spectrum (DPS), which is not only robust to additive noise, but also robust to spectrum colorization due to channel effects. When the clean training set is available, we show that a model-based noise compensation method can be effective to improve system robustness to noise. Given the testing sets, as a whole, the feature-based methods can yield about 22% relative improvement in accuracy for multi-condition training task, and the model-based method can have about 63% relative performance improvement when systems were trained on clean training set. Volume 1, page 233 Session A41


Session B14 - Oral Tuesday - 09.00 - 10.40

Speech Recognition and Understanding:

LVCSR - I (Large Vocabulary Continuous Speech Recognition)


Beyerlein P, Aubert X, Harris M, Meyer C, Schramm H

Philips Research Laboratories Aachen, Germany

Automatic speech recognition of real-life conversational speech is a precondition for building natural human-centered man-machine interfaces. Being able to extract speech utterances from real-life broadcast news audio streams and transcribing them with an overall word accuracy of 83% we are still faced with the problem of transcribing true conversational speech in real-life (i.e. bad) background conditions. The switchboard task focusses on the latter problem. The paper summarizes a set of experimental investigations on the switchboard corpus using the Philips LVCSR system. Volume 1, page 499 Session B14


Session B21 - Oral Tuesday - 11.10 - 12.50

Noise Robust Recognition: Robust systems - What helps ?


Lieb M, Fischer A

Philips Research Labs Aachen, Germany

With this work we evaluate the Philips continuous speech recognition system on the standardized AURORA noisy digit string recognition task. A variety of noise robust algorithms, ranging from spectral subtraction during the feature extraction stage, to adaptation techniques in the HMM-decoding stage, are applied and their effects are presented. Detailed experimental results show the contribution of the single approaches to the overall system performance. By thoroughly combining the best performing of the standard algorithms, we achieve significant improvements for the matched training as well as for the non-matched condition scenarios. Volume 1, page 625 Session B21


Saon G, Huerta J, Jan E-E

IBM T.J. Watson Research Center, USA

In this paper we describe some experiments on the Aurora 2 noisy digits database. The algorithms that we used can be broadly classified into noise robustness techniques based on a linear-channel model of the acoustic environment such as CDCN and its novel variant termed Alignment-based CDCN (ACDCN, proposed here), and techniques which do not assume any particular knowledge about the structure of the environment or noise conditions affecting the speech signal such as discriminant feature space transformations and speaker/channel adaptation. We present recognition experiments for both the clean training data and the multi-condition training data scenarios. Volume 1, page 629 Session B21


Afify M, Jiang H, Korkmazskiy F, Lee C-H, Li Q, Siohan O, Soong F

K, Surendran A C

Bell Labs, Lucent Technologies, USA

Connected digit recognition has always been an ideal task for fundamental research in speech recognition due to its low complexity and potential applicaitons. In Bell Labs we have developed a number of techniques targeting directly or indirectly at connected digit recognition. For the Aurora task, we study a few such algorithms for the entire spectrum of the issues, including feature extraction, context-dependent digit modeling, minimum classification error acoustic modeling, unsupervised noise compensation, and utterance verification. We show how each component contributes to the reduction of digit recognition and verification errors. Average over all three test sets we obtained 84.6% and 91.3% digit accuracies for clean- and multi-condition training, respectively. This represents an average of 48.6% error rate reduction when compared to the official Aurora baseline results. Volume 1, page 633 Session B21


Session B35 - Poster (I), B46 (II) Tuesday - 14.00 - 15.20

Speech Recognition and Understanding: Noise Robustness


Trentin E, Gori M

Universita' di Siena, Italy

Acoustic models relying on hidden Markov models (HMMs) are heavily noise-sensitive: recognition performance drops whenever a significant difference in acoustic conditions holds between training and test environments. The relevance of developing acoustic models that are intrinsically robust has to be stressed. Robustness to noise is related to the generalization capabilities of the model. Artificial neural networks (ANNs) appear to be a promising alternative, but they historically failed as a general paradigm for speech recognition. This paper faces the problem by (i) investigating the recognition performance of the ANN/HMM hybrid proposed by the authors over tasks with noisy signals, and (ii) proposing an explicit ``soft'' weight grouping technique, capable to improve its robustness. Experiments over noisy speaker-independent connected-digits strings are presented. In particular, results on the VODIS II/SpeechDatCar database, collected in a real car environment, show the dramatic gain over the standard HMM, as well as over Bourlard and Morgan's hybrid. Volume 2, page 889 Session B35


- in here : why Neural Networks fail !!!


Renevey P, Vetter R, Krauss J

CSEM, Switzerland

This paper addresses the problem of speech recognition in noisy conditions when low complexity is required like in embedded systems. In such systems, vector quantization is generally used to reduce the complexity of the recognition systems (e.g. HMMs). A novel approach for vector quantization based on the missing data theory is proposed. This approach allows to increase the robustness of the system against the noise perturbations with only a small increase of the computational requirements. The proposed algorithm is composed of two parts. The first part consists in dividing the spectral temporal features of the noisy signal into two subspaces: the unreliable (or missing) features and the reliable (or present) features. The second part of the proposed approach consists in defining a robust distance measure for vector quantization that compensates for the unreliable features. The proposed approach obtains similar results in noisy conditions than a more classical approach that consists in adapting the codebook of the vector quantization to the noisy conditions using model compensation. However the computation requirements are lower in the proposed approach and it is more suitable for a low complexity speech recognition system. Volume 2, page 1107 Session B46

-  state of the art references given in here


Session C14 - Oral Wednesday - 09.00 - 10.40

Speech Recognition and Understanding: Speaker Adaptation


Gunawardana A, Byrne W

The Johns Hopkins University, USA.

We present a simplified derivation of the extended Baum-Welch procedure, which shows that it can be used for Maximum Mutual Information (MMI) of a large class of continuous emission density hidden Markov models (HMMs). We use the extended Baum-Welch procedure for discriminative estimation of MLLR-type speaker adaptation transformations. The resulting adaptation procedure, termed Conditional Maximum Likelihood Linear Regression (CMLLR), is used successfully for supervised and unsupervised adaptation tasks on the Switchboard corpus, yielding an improvement over MLLR. The interaction of unsupervised CMLLR with segmental minimum Bayes risk lattice voting procedures is also explored, showing that the two procedures are complimentary. Volume 2, page 1203 Session C14


Session D15 - Poster Thursday - 09.00 - 10.40

Speech Recognition and Understanding: Acoustic Modelling - II


Duchateau J, Demuynck K, Van Compernolle D, Wambacq P

Katholieke Universiteit Leuven - ESAT, Belgium

The aim of discriminant feature analysis techniques in the signal processing of speech recognition systems is to find a feature vector transformation which maps a high dimensional input vector onto a low dimensional vector while retaining a maximum amount of information in the feature vector to discriminate between predefined classes. This paper points out the significance of the definition of the classes in the discriminant feature analysis technique. Three choices for the definition of the classes are investigated: the phonemes, the states in context independent acoustic models and the tied states in context dependent acoustic models. These choices for the classes were applied to (1) standard LDA (linear discriminant analysis) for reference and to (2) MIDA, an improved, mutual information based discriminant analysis technique. Evaluation of the resulting linear feature transforms on a large vocabulary continuous speech recognition task shows, depending on the technique, the best choice for the classes. Volume 3, page 1621 Session D15

- MIDA transformation !


Yu P, Wang Z

Tsinghua University, P. R. China

Acoustic model training is very important in speech recognition. But in traditional training algorithm, we take each state separately, and the relationship between different states is not considered. In this paper we bring forward a novel idea of using the correlation information between states, which is called ~{!0~}spatial correlation~{!1~}. We describe this correlation information as linear constraints. According to phonetic knowledge, we firstly divide states into small groups named ~{!0~}correlation sub-space~{!1~}. In every sub-space, we use eigen value decomposition to get linear constraints. The constraints are then used in a new training algorithm. Experiments of the new training algorithm show significant improvement over traditional training algorithm. Volume 3, page 1629 Session D15

- sub-spaces and eigenvalue decompositions


Keshet J 1 , Chazan D 2 , Bobrovsky B-Z 1

1 Tel Aviv University, Israel, 2 IBM Israel - Science and Technology,Israel

This paper presents a novel algorithm for precise spotting of plosives. The algorithm is based on a pattern matching technique implemented with margin classifiers, such as support vector machines (SVM). A special hierarchical treatment to overcome the problem of fricative and false silence detection is presented. It uses the loss-based multi-class decisions. Furthermore, a method for smoothing the overall decisions by sequential linear programming is described. The proposed algorithm was tested on the TIMIT corpus, which produced a very high spotting accuracy. The algorithm presented here is applied to plosives detection, but can easily be adapted to any class of phonemes. Volume 3, page 1637 Session D15


Litichever Z 1 , Chazan D 2

1 Technion Technology Institute, Israel, 2 IBM Israel - Science and Technology, Israel

This paper addresses the problem of classification of speech transition sounds. A number of non parametric classifiers are compared, and it is shown that some non-parametric classifiers have considerable advantages over traditional hidden Markov models. Among the non parametric classifiers, support vector machines were found the most suitable and the easiest to tune. Some of the reasons for the superiority of non parametric classifiers will be discussed. The algorithm was tested on the voiced stop consonant phones extracted from the TIMIT corpus and resulted in very low error rates.

- Last two papers (same group)  use support vector machines. MK read this…


Shimodaira H 1 , Noma K-I 1 , Nakai M 1 , Sagayama S 2

1 Japan Advanced Institute of Science and Technology, Japan, 2 University of Tokyo, Japan

A new class of Support Vector Machine (SVM) which is applicable to sequential-pattern recognition is developed by incorporating an idea of non-linear time alignment into the kernel. Since time-alignment operation of sequential pattern is embedded in the kernel evaluation, same algorithms with the original SVM for training and classification can be employed without modifications. Furthermore, frame-wise evaluation of kernel in the proposed SVM (DTAK-SVM) enables frame-synchronous recognition of sequential pattern, which is suitable for continuous speech recognition. Preliminary experiments of speaker-dependent 6 voiced-consonants recognition demonstrated excellent recognition performance of more than 98% in correct classification rate, whereas 93% by hidden Markov models (HMMs). Volume 3, page 1841 Session D25

-  and again.


Session E11 - Oral Friday - 09.00 - 10.40

Integration of Phonetic Knowledge in Speech Technology: Experiments and Experiences


Pastor-i-Gadea M, Casacuberta F

Universitat Politècnica de València, Spain

The great variability of word pronunciations in spontaneous speech is one of the reasons for the low performance of present speech recognition systems. The generation of dictionaries that take into account this variability can increase the robustness of such systems. A word pronunciation is a possible phone sequence that can appear in a real utterance, and represents a possible acoustic realization of the word.   Here, word pronunciations are modeled using finite state automata. The use of such models allow for the application of grammatical inference methods and an easy integration with the others sources of acknowledge. The training samples are obtained from the alignment between the phone decodification of each training utterance and the corresponding canonical transcription. Models proposed in this work were applied in a translation-oriented speech task. The improvements achieved by these models were in the range betwen 2.7 to 0.6 points depending on the language model used. Volume 4, page 2297 Session E11

- nice stuff: P(W|A) = P(W)  P(S|W) P(A|S) with S = Pronunciation

- S modelled by finite state automaton: another HMM level


Session E26 - Poster Friday - 11.10 - 12.30

Signal Analysis: Source Localisation and Beam Forming


Saruwatari H, Kawamura T, Shikano K

Nara Institute of Science and Technology, Japan

We propose a new algorithm for blind source separation (BSS), in which independent component analysis (ICA) and beamforming are combined to resolve the low-convergence problem through optimization in ICA. The proposed method consists of the following three parts: (1) frequency-domain ICA with direction-of-arrival (DOA) estimation, (2) null beamforming based on the estimated DOA, and (3) integration of (1) and (2) based on the algorithm diversity in both iteration and frequency domain. The inverse of the mixing matrix obtained by ICA is temporally substituted by the matrix based on null beamforming throughiterative optimization, and the temporal alternation between ICA and beamforming can realize fast- and high-convergence optimization. The results of the signal separation experiments reveal that the signal separation performance of the proposed algorithm is superior to that of the conventional ICA-based BSS method, even under reverberant conditions. Volume 4, page 2603 Session E26

Letzte Änderung: 19.10.2011 - Ansprechpartner: Dipl.-Ing. Arno Krüger