Otto-von-Guericke-Universität Magdeburg


Interspeech 2005 Lisboa

Interspeech 2005

Auswahl: SK
Nov. 2005


- Plenary Talk
Fernando C.N. Pereira
Page: 717-720, Paper Number: 3002
Abstract: Over the last few years, several groups have been developing models and algorithms for learning to predict the structure of complex data, sequences in particular, that extend well-known linear classification models and algorithms, such as logistic regression, the perceptron algorithm, and support vector machines. These methods combine the advantages of discriminative learning with those of probabilistic generative models like HMMs and probabilistic context-free grammars. I will introduce linear models for structure prediction and their simplest learning algorithms, and exemplify their benefits with applications to text and speech processing, including information extraction, parsing, and language modeling.
presented by Mam 03.11.05



Zhenchun Lei
Yingchun Yang
Zhaohui Wu
Page: 2041-2044, Paper Number: 1675
Abstract: In this paper, the mixture of support vector machines is proposed and applied to text-independent speaker recognition. The mixture of experts is used and is implemented by the divide-and-conquer approach. The purpose of adopting this idea is to deal with the large scale speech data and improve the performance of speaker recognition. The principle is to train several parallel SVMs on the subsets of the whole dataset and then combine them in the distance or probabilistic fashion. The experiments have been run on the YOHO database, and the results show that the mixture model is superior to the basic Gaussian mixture model.
presented by SK 03.11.05


dPLRM-Based Speaker Identification with Log Power Spectrum

Tomoko Matsui
Kunio Tanabe
Page: 2017-2020, Paper Number: 1384
Abstract: This paper investigates speaker identification with implicit extraction of speaker characteristics relevant to discrimination from the log power spectrum of training speech by employing the inductive power of dual Penalized Logistic Regression Machine(dPLRM). The dPLRM is one of kernel methods like the support vector machine (SVM) and has the inductive power due to the mechanism of the kernel regression. In text-independent speaker identification experiments with training speech uttered by 10 male speakers in three different sessions, we  compares the performances of dPLRM, SVM and Gaussian mixture model  (GMM)-based methods and show that dPLRM implicitly and effectively  extracts speaker characteristics from the log power spectrum. It is also shown that dPLRM outperforms the other methods especially when the amount of training data is small.
presented by MK 03.11.05



Vincent Wan
James Carmichael
Page: 3321-3324, Paper Number: 1547
Abstract: This paper describes a new formulation of a polynomial sequence kernel based on dynamic time warping (DTW) for support vector machine (SVM) classification of isolated words given very sparse training data. The words are uttered by dysarthric speakers who suffer from debilitating neurological conditions that make the collection of speech samples a time-consuming and low-yield process. Data for building dysarthric speech recognition engines are therefore limited. Simulations show that the SVM based approach is significantly better than standard DTW and hidden Markov model (HMM) approaches when given sparse training data. In conditions where the models were constructed from three examples of each word, the SVM approach recorded a 45% lower error rate (relative) than the DTW approach and a 35% lower error rate than the HMM approach.
presented by EA 03.11.05


Effects of Bayesian Predictive Classification Using Variational Bayesian Posteriors for Sparse Training Data in Speech Recognition

Shinji Watanabe
Atsushi Nakamura
Page: 1105-1108, Paper Number: 2278
Abstract: We introduce a robust classification method using Bayesian predictive  distribution (Bayesian predictive classification, referred to as BPC)  into speech recognition. We and others have recently proposed a total  Bayesian framework for speech recognition, Variational Bayesian Estimation and Clustering for speech recognition (VBEC). VBEC includes an analytical derivation of approximate posterior distributions that are essential for BPC, based on variational Bayes (VB). BPC using VB posterior distributions (VB-BPC) can mitigate the over-training effects by marginalizing output distribution. We address the sparse data problem in speech recognition, and show how VB-BPC is robust against the data sparseness, experimentally.
presented by EA 17.11.05



Asela Gunawardana
Milind Mahajan
Alex Acero
John C. Platt
Page: 1117-1120, Paper Number: 1372
Abstract: In this paper, we show the novel application of hidden conditional andom fields (HCRFs), conditional random fields with hidden state sequences, for modeling speech. Hidden state sequences are critical for modeling the non-stationarity of speech signals. We show that HCRFs can easily be trained using the simple direct optimization technique  of stochastic gradient descent. We present the results on the TIMIT phone classification task and show that HCRFs outperforms comparable  ML and CML/MMI trained HMMs. In fact, HCRF results on this task are the best single classifier results known to us. We note that the HCRF framework is easily extensible to recognition since it is a state and label sequence modeling technique. We also note that HCRFs have the ability to handle complex features without any change in training procedure.
presented by Mam 03.11.05



Viktoria Maier
Roger K. Moore
Page: 1245-1248, Paper Number: 1369
Abstract: This paper investigates a simulation of episodic memory known in the literature as 'hape minerva 2'. shape Minerva 2 is a computational multiple-trace memory model that successfully predicts basic findings from the schema-abstraction literature. This model has been implemented and tested on a simple ASR task using vowel formant data taken from the Peterson & Barney database. Recognition results are compared to a number of state-of-the-art pattern classifiers, and it is shown that the episodic model achieves the best performance.
presented by Mam 17.11.05


Sorin Dusan
Larry R. Rabiner
Page: 1233-1236, Paper Number: 2177
Abstract: In spite of the effort and progress made during the last few decades,  the performance of automatic speech recognition (ASR) systems still   lags far behind that achieved by humans. Some researchers think that more speech data will be sufficient in order to bridge this performance gap. Others think that radical modifications to the current methods need to be made, and possible inspirations for these modifications should come from human speech perception (HSP). This paper focuses 
on two issues: first, it presents a comparison between HSP and ASR emphasizing some insights from HSP that could still be applied in ASR; second, it presents some ideas for extracting useful non-linguistic information from the speech signal, the so called 'ich transcription', which could help in selecting specialized acoustic-linguistic models that offer higher accuracy than the general models.
presented by KT 03.11.05



Roger Hsiao
Brian Mak
Page: 1797-1800, Paper Number: 2350
Abstract: Eigenvoice (EV) speaker adaptation has been shown effective for fast speaker adaptation when the amount of adaptation data is scarce. In the past two years, we have been investigating the application of kernel methods to improve EV speaker adaptation by exploiting possible nonlinearity in the speaker space, and two methods were proposed: embedded kernel eigenvoice (eKEV) and kernel eigenspace-based MLLR (KEMLLR). In both methods, kernel PCA is used to derive eigenvoices in the kernel-induced high-dimensional feature space, and they differ mainly in the representation of the speaker models. Both had been shown to outperform all other   common adaptation methods when the amount of adaptation data is less than 10s. However, in the past, only small vocabulary speech recognition tasks were tried since we were not familiar with the behaviour of these kernelized methods. As we gain more experience, we are now ready to tackle larger vocabularies. In this paper, we show that both methods continue to outperform MAP, and MLLR when only 5s or 10s of adaptation data are available on the WSJ0 5K-vocabulary task. Compared with the  speaker-independent model, the two methods reduce recognition word error rate by 13.4%--21.1%.
presented by MS 17.11.05



Niels Ole Bernsen
Laila Dybkjaer
Page: 2473-2476, Paper Number: 1342
Abstract: The Hans Christian Andersen (HCA) system is an example of a new generation of embodied conversational characters which are aimed to faithfully represent a familiar historical individual and carry out human-style conversation as that individual would have done had he or she lived today. A first prototype of fairytale author HCA was tested with representative users in January 2004. This paper reports on the user test of the second prototype which was done in February 2005, focusing on the structured user interview results.
presented by KT 17.11.05



Multi-Task Learning Strategies for a Recurrent Neural Net in a Hybrid Tied-Posteriors Acoustic Model

Jan Stadermann  
Wolfram Koska
Gerhard Rigoll
Page: 2993-2996, Paper Number: 1729
Abstract: An important goal of an automatic classifier is to learn the best possible generalization from given training material. One possible improvement over a standard learning algorithm is to train several related tasks in parallel. We apply the multi-task learning scheme to a recurrent neural network estimating phoneme posterior probabilities and HMM state posterior probabilities, respectively. A comparison of networks with different additional tasks within a hybrid NN/HMM acoustic model is presented. The evaluation has been performed using the WSJ0 speaker independent test set with a closed vocabulary of 5000 words and shows   a significant improvement compared to a standard hybrid acoustic model if gender classification is used as additional task.
presented by AW 03.11.05



Yasser Hifny    
Steve Renals 
Neil D. Lawrence
Page: 3017-3020, Paper Number: 2185
Abstract: The aim of this work is to develop a practical framework, which extends   the classical Hidden Markov Models (HMM) for continuous speech recognition based on the Maximum Entropy (MaxEnt) principle. The MaxEnt models  can estimate the posterior probabilities directly as with Hybrid NN/HMM connectionist speech recognition systems. In particular, a new acoustic modelling based on discriminative MaxEnt models is formulated and is being developed to replace the generative Gaussian Mixture Models (GMM) commonly used to model acoustic variability. Initial experimental results using the TIMIT phone task are reported.
presented by SK 17.11.05




Letzte Änderung: 19.10.2011 - Ansprechpartner: Dipl.-Ing. Arno Krüger