Yoshiki Masuyama

Yoshiki Masuyama
  • Email: masuyama[at]merl[dot]com
  • Biography

    Yoshiki's research interests focus on the integration of signal processing and machine learning technologies for efficient and robust audio processing. He has worked on a wide range of audio signal processing tasks, especially multichannel speech separation, robust automatic speech recognition, and multimodal learning. He is the recipient of the Best Student Paper Award at the IEEE Spoken Language Technology Workshop 2022.

  • Awards

    •  AWARD    MERL team wins the Listener Acoustic Personalisation (LAP) 2024 Challenge
      Date: August 29, 2024
      Awarded to: Yoshiki Masuyama, Gordon Wichern, Francois G. Germain, Christopher Ick, and Jonathan Le Roux
      MERL Contacts: François Germain; Jonathan Le Roux; Gordon Wichern; Yoshiki Masuyama
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • MERL's Speech & Audio team ranked 1st out of 7 teams in Task 2 of the 1st SONICOM Listener Acoustic Personalisation (LAP) Challenge, which focused on "Spatial upsampling for obtaining a high-spatial-resolution HRTF from a very low number of directions". The team was led by Yoshiki Masuyama, and also included Gordon Wichern, Francois Germain, MERL intern Christopher Ick, and Jonathan Le Roux.

        The LAP Challenge workshop and award ceremony was hosted by the 32nd European Signal Processing Conference (EUSIPCO 24) on August 29, 2024 in Lyon, France. Yoshiki Masuyama presented the team's method, "Retrieval-Augmented Neural Field for HRTF Upsampling and Personalization", and received the award from Prof. Michele Geronazzo (University of Padova, IT, and Imperial College London, UK), Chair of the Challenge's Organizing Committee.

        The LAP challenge aims to explore challenges in the field of personalized spatial audio, with the first edition focusing on the spatial upsampling and interpolation of head-related transfer functions (HRTFs). HRTFs with dense spatial grids are required for immersive audio experiences, but their recording is time-consuming. Although HRTF spatial upsampling has recently shown remarkable progress with approaches involving neural fields, HRTF estimation accuracy remains limited when upsampling from only a few measured directions, e.g., 3 or 5 measurements. The MERL team tackled this problem by proposing a retrieval-augmented neural field (RANF). RANF retrieves a subject whose HRTFs are close to those of the target subject at the measured directions from a library of subjects. The HRTF of the retrieved subject at the target direction is fed into the neural field in addition to the desired sound source direction. The team also developed a neural network architecture that can handle an arbitrary number of retrieved subjects, inspired by a multi-channel processing technique called transform-average-concatenate.
    •  
    •  AWARD    MERL team wins the Audio-Visual Speech Enhancement (AVSE) 2023 Challenge
      Date: December 16, 2023
      Awarded to: Zexu Pan, Gordon Wichern, Yoshiki Masuyama, Francois Germain, Sameer Khurana, Chiori Hori, and Jonathan Le Roux
      MERL Contacts: François Germain; Chiori Hori; Sameer Khurana; Jonathan Le Roux; Gordon Wichern; Yoshiki Masuyama
      Research Areas: Artificial Intelligence, Machine Learning, Speech & Audio
      Brief
      • MERL's Speech & Audio team ranked 1st out of 12 teams in the 2nd COG-MHEAR Audio-Visual Speech Enhancement Challenge (AVSE). The team was led by Zexu Pan, and also included Gordon Wichern, Yoshiki Masuyama, Francois Germain, Sameer Khurana, Chiori Hori, and Jonathan Le Roux.

        The AVSE challenge aims to design better speech enhancement systems by harnessing the visual aspects of speech (such as lip movements and gestures) in a manner similar to the brain’s multi-modal integration strategies. MERL’s system was a scenario-aware audio-visual TF-GridNet, that incorporates the face recording of a target speaker as a conditioning factor and also recognizes whether the predominant interference signal is speech or background noise. In addition to outperforming all competing systems in terms of objective metrics by a wide margin, in a listening test, MERL’s model achieved the best overall word intelligibility score of 84.54%, compared to 57.56% for the baseline and 80.41% for the next best team. The Fisher’s least significant difference (LSD) was 2.14%, indicating that our model offered statistically significant speech intelligibility improvements compared to all other systems.
    •  
    See All Awards for MERL
  • MERL Publications

    •  Masuyama, Y., Wichern, G., Germain, F.G., Pan, Z., Khurana, S., Hori, C., Le Roux, J., "NIIRF: Neural IIR Filter Field for HRTF Upsampling and Personalization", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), DOI: 10.1109/​ICASSP48485.2024.10448477, March 2024, pp. 1016-1020.
      BibTeX TR2024-026 PDF Software
      • @inproceedings{Masuyama2024mar,
      • author = {Masuyama, Yoshiki and Wichern, Gordon and Germain, François G and Pan, Zexu and Khurana, Sameer and Hori, Chiori and Le Roux, Jonathan},
      • title = {NIIRF: Neural IIR Filter Field for HRTF Upsampling and Personalization},
      • booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
      • year = 2024,
      • pages = {1016--1020},
      • month = mar,
      • doi = {10.1109/ICASSP48485.2024.10448477},
      • url = {https://www.merl.com/publications/TR2024-026}
      • }
    •  Pan, Z., Wichern, G., Masuyama, Y., Germain, F.G., Khurana, S., Hori, C., Le Roux, J., "Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction", IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), DOI: 10.1109/​ASRU57964.2023.10389618, December 2023.
      BibTeX TR2023-152 PDF Video
      • @inproceedings{Pan2023dec2,
      • author = {Pan, Zexu and Wichern, Gordon and Masuyama, Yoshiki and Germain, François G and Khurana, Sameer and Hori, Chiori and Le Roux, Jonathan},
      • title = {Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction},
      • booktitle = {IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU)},
      • year = 2023,
      • month = dec,
      • doi = {10.1109/ASRU57964.2023.10389618},
      • isbn = {979-8-3503-0689-7},
      • url = {https://www.merl.com/publications/TR2023-152}
      • }