News & Events

TALK Bayesian Group Sparse Learning
Date & Time: Monday, January 28, 2013; 11:00 AM
Speaker: Prof. Jen-Tzung Chien, National Chiao Tung University, Taiwan
Research Area: Speech & Audio
Abstract
- Bayesian learning provides attractive tools to model, analyze, search, recognize and understand real-world data. In this talk, I will introduce a new Bayesian group sparse learning and its application on speech recognition and signal separation. First of all, I present the group sparse hidden Markov models (GS-HMMs) where a sequence of acoustic features is driven by Markov chain and each feature vector is represented by two groups of basis vectors. The features across states and within states are represented accordingly. The sparse prior is imposed by introducing the Laplacian scale mixture (LSM) distribution. The robustness of speech recognition is illustrated. On the other hand, the LSM distribution is also incorporated into Bayesian group sparse learning based on the nonnegative matrix factorization (NMF). This approach is developed to estimate the reconstructed rhythmic and harmonic music signals from single-channel source signal. The Monte Carlo procedure is presented to infer two groups of parameters. The future work of Bayesian learning shall be discussed.
TALK Speech recognition for closed-captioning
Date & Time: Tuesday, December 11, 2012; 12:00 PM
Speaker: Takahiro Oku, NHK Science & Technology Research Laboratories
Research Area: Speech & Audio
Abstract
- In this talk, I will present human-friendly broadcasting research conducted in NHK and research on speech recognition for real-time closed-captioning. The goal of human-friendly broadcasting research is to make broadcasting more accessible and enjoyable for everyone, including children, elderly, and physically challenged persons. The automatic speech recognition technology that NHK has developed makes it possible to create captions for the hearing impaired in real-time automatically. For sports programs such as professional sumo wrestling, a closed-captioning system has already been implemented in which captions are created by using speech recognition on a captioning re-speaker. In 2011, NHK General Television started broadcasting of closed captions for the information program "Morning Market". After the introduction of the implemented closed-captioning system, I will talk about our recent improvement obtained by an adaptation method that creates a more effective acoustic model using error correction results. The method reflects recognition error tendencies more effectively.
NEWS APSIPA Transactions on Signal and Information Processing: publication by Shinji Watanabe and others
Date: December 6, 2012
Where: APSIPA Transactions on Signal and Information Processing
Research Area: Speech & Audio
Brief
- The article "Bayesian Approaches to Acoustic Modeling: A Review" by Watanabe, S. and Nakamura, A. was published in APSIPA Transactions on Signal and Information Processing.
NEWS Techniques for Noise Robustness in Automatic Speech Recognition: publication by Jonathan Le Roux, John R. Hershey and others
Date: November 28, 2012
Where: Techniques for Noise Robustness in Automatic Speech Recognition
MERL Contact: Jonathan Le Roux
Research Area: Speech & Audio
Brief
- The article "Factorial Models for Noise Robust Speech Recognition" by Hershey, J.R., Rennie, S.J. and Le Roux, J. was published in the book Techniques for Noise Robustness in Automatic Speech Recognition.
NEWS IEEE Signal Processing Magazine: publication by Shinji Watanabe and others
Date: November 1, 2012
Where: IEEE Signal Processing Magazine
Research Area: Speech & Audio
Brief
- The article "Structured Discriminative Models For Speech Recognition" by Gales, M., Watanabe, S. and Fosler-Lussier, E. was published in IEEE Signal Processing Magazine.
TALK Recognizing and Classifying Environmental Sounds
Date & Time: Wednesday, October 24, 2012; 11:00 AM
Speaker: Prof. Dan Ellis, Columbia University
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
TALK Advances in Acoustic Modeling at IBM Research: Deep Belief Networks, Sparse Representations
Date & Time: Wednesday, October 24, 2012; 9:55 AM
Speaker: Dr. Tara Sainath, IBM Research
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
TALK Factorial Hidden Restricted Boltzmann Machines for Noise Robust Speech Recognition
Date & Time: Wednesday, October 24, 2012; 3:20 PM
Speaker: Dr. Steven J. Rennie, IBM Research
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
TALK A new class of dynamical system models for speech and audio
Date & Time: Wednesday, October 24, 2012; 4:05 PM
Speaker: Dr. John R. Hershey, MERL
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
EVENT SANE 2012 - Speech and Audio in the Northeast
Date & Time: Wednesday, October 24, 2012; 8:30 AM - 5:00 PM
Location: MERL
MERL Contact: Jonathan Le Roux
Research Area: Speech & Audio
Brief
- SANE 2012, a one-day event gathering researchers and students in speech and audio from the northeast of the American continent, will be held on Wednesday October 24, 2012 at Mitsubishi Electric Research Laboratories (MERL) in Cambridge, MA.
TALK Understanding Audition via Sound Analysis and Synthesis
Date & Time: Wednesday, October 24, 2012; 11:45 AM
Speaker: Josh McDermott, MIT, BCS
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
TALK Self-Organizing Units (SOUs): Training Speech Recognizers Without Any Transcribed Audio
Date & Time: Wednesday, October 24, 2012; 2:15 PM
Speaker: Dr. Herb Gish, BBN - Raytheon
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
TALK Latent Topic Modeling of Conversational Speech
Date & Time: Wednesday, October 24, 2012; 1:30 PM
Speaker: Dr. Timothy J. Hazen and David Harwath, MIT Lincoln Labs / MIT CSAIL
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
TALK Zero-Resource Speech Pattern and Sub-Word Unit Discovery
Date & Time: Wednesday, October 24, 2012; 9:10 AM
Speaker: Prof. Jim Glass and Chia-ying Lee, MIT CSAIL
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
NEWS HFES 2012: publication by Bret A. Harsham and others
Date: October 22, 2012
Where: Annual Meeting of the Human Factors and Ergonomics Society (HFES)
Research Area: Speech & Audio
Brief
- The paper "Evaluation of Two Types of In-Vehicle Music Retrieval and Navigation Systems" by Zhang, J., Borowsky, A., Schmidt-Nielsen, B., Harsham, B., Weinberg, G., Romoser, M.R.E. and Fisher, D.L. was presented at the Annual Meeting of the Human Factors and Ergonomics Society (HFES).
TALK Non-negative Hidden Markov Modeling of Audio
Date & Time: Thursday, October 11, 2012; 2:30 PM
Speaker: Dr. Gautham J. Mysore, Adobe
Research Area: Speech & Audio
Abstract
- Non-negative spectrogram factorization techniques have become quite popular in the last decade as they are effective in modeling the spectral structure of audio. They have been extensively used for applications such as source separation and denoising. These techniques however fail to account for non-stationarity and temporal dynamics, which are two important properties of audio. In this talk, I will introduce the non-negative hidden Markov model (N-HMM) and the non-negative factorial hidden Markov model (N-FHMM) to model single sound sources and sound mixtures respectively. They jointly model the spectral structure and temporal dynamics of sound sources, while accounting for non-stationarity. I will also discuss the application of these models to various applications such as source separation, denoising, and content based audio processing, showing why they yield improved performance when compared to non-negative spectrogram factorization techniques.
TALK Tensor representation of speaker space for arbitrary speaker conversion
Date & Time: Thursday, September 6, 2012; 12:00 PM
Speaker: Dr. Daisuke Saito, The University of Tokyo
Research Area: Speech & Audio
Abstract
- In voice conversion studies, realization of conversion from/to an arbitrary speaker's voice is one of the important objectives. For this purpose, eigenvoice conversion (EVC) based on an eigenvoice Gaussian mixture model (EV-GMM) was proposed. In the EVC, similarly to speaker recognition approaches, a speaker space is constructed based on GMM supervectors which are high-dimensional vectors derived by concatenating the mean vectors of each of the speaker GMMs. In the speaker space, each speaker is represented by a small number of weight parameters of eigen-supervectors. In this talk, we revisit construction of the speaker space by introducing the tensor analysis of training data set. In our approach, each speaker is represented as a matrix of which the row and the column respectively correspond to the Gaussian component and the dimension of the mean vector, and the speaker space is derived by the tensor analysis of the set of the matrices. Our approach can solve an inherent problem of supervector representation, and it improves the performance of voice conversion. Experimental results of one-to-many voice conversion demonstrate the effectiveness of the proposed approach.
NEWS IWSML 2012: publication by Jonathan Le Roux, John R. Hershey and others
Date: March 31, 2012
Where: International Workshop on Statistical Machine Learning for Speech Processing (IWSML)
MERL Contact: Jonathan Le Roux
Research Area: Speech & Audio
Brief
- The paper "Latent Dirichlet Reallocation for Term Swapping" by Heaukulani, C., Le Roux, J. and Hershey, J.R. was presented at the International Workshop on Statistical Machine Learning for Speech Processing (IWSML).
NEWS ASJ 2012: publication by Jonathan Le Roux and John R. Hershey
Date: March 13, 2012
Where: Acoustical Society of Japan Spring Meeting (ASJ)
MERL Contact: Jonathan Le Roux
Research Area: Speech & Audio
Brief
- The paper "Speech Enhancement by Indirect VTS" by Le Roux, J. and Hershey, J.R. was presented at the Acoustical Society of Japan Spring Meeting (ASJ).
TALK Learning Intermediate-Level Representations of Form and Motion from Natural Movies
Date & Time: Wednesday, February 22, 2012; 11:00 AM
Speaker: Dr. Charles Cadieu, McGovern Institute for Brain Research, MIT
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
Abstract
- The human visual system processes complex patterns of light into a rich visual representation where the objects and motions of our world are made explicit. This remarkable feat is performed through a hierarchically arranged series of cortical areas. Little is known about the details of the representations in the intermediate visual areas. Therefore, we ask the question: can we predict the detailed structure of the representations we might find in intermediate visual areas?
  
  In pursuit of this question, I will present a model of intermediate-level visual representation that is based on learning invariances from movies of the natural environment and produces predictions about intermediate visual areas. The model is composed of two stages of processing: an early feature representation layer, and a second layer in which invariances are explicitly represented. Invariances are learned as the result of factoring apart the temporally stable and dynamic components embedded in the early feature representation. The structure contained in these components is made explicit in the activities of second-layer units that capture invariances in both form and motion. When trained on natural movies, the first-layer produces a factorization, or separation, of image content into a temporally persistent part representing local edge structure and a dynamic part representing local motion structure. The second-layer units are split into two populations according to the factorization in the first-layer. The form-selective units receive their input from the temporally persistent part (local edge structure) and after training result in a diverse set of higher-order shape features consisting of extended contours, multi-scale edges, textures, and texture boundaries. The motion-selective units receive their input from the dynamic part (local motion structure) and after training result in a representation of image translation over different spatial scales and directions, in addition to more complex deformations. These representations provide a rich description of dynamic natural images, provide testable hypotheses regarding intermediate-level representation in visual cortex, and may be useful representations for artificial visual systems.
EVENT Audio and Music Signal Processing Mini-Symposium
Date & Time: Thursday, October 20, 2011; 2:00 PM -5:00 PM
Location: MERL
MERL Contact: Jonathan Le Roux
Research Area: Speech & Audio
Brief
- MERL is hosting a mini-symposium on audio and music signal processing, with three talks by eminent researchers in the field: Prof. Mark Plumbley, Dr. Cedric Fevotte and Prof. Nobutaka Ono.
TALK Itakura-Saito nonnegative matrix factorization and friends for music signal decomposition
Date & Time: Thursday, October 20, 2011; 3:00 PM
Speaker: Dr. Cedric Fevotte, CNRS - Telecom ParisTech, Paris
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
TALK Auxiliary Function Approach to Source Localization and Separation
Date & Time: Thursday, October 20, 2011; 3:40 PM
Speaker: Prof. Nobutaka Ono, National Institute of Informatics, Tokyo
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
TALK Analysing Digital Music
Date & Time: Thursday, October 20, 2011; 2:20 PM
Speaker: Prof. Mark Plumbley, Queen Mary, London
MERL Host: Jonathan Le Roux
Research Area: Speech & Audio
NEWS International Driving Symposium on Human Factors in Driver Assessment, Training and Vehicle Design 2011: publication by Bret A. Harsham and others
Date: June 27, 2011
Where: International Driving Symposium on Human Factors in Driver Assessment, Training and Vehicle Design
Research Area: Speech & Audio
Brief
- The paper "Investigating HUDs or the Presentation of Choice Lists in Car navigation Systems" by Weinberg, G., Harsham, B. and Medenica, Z. was presented at the International Driving Symposium on Human Factors in Driver Assessment, Training and Vehicle Design.