TR2025-037
No Class Left Behind: A Closer Look at Class Balancing for Audio Tagging
-
- "No Class Left Behind: A Closer Look at Class Balancing for Audio Tagging", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), March 2025.BibTeX TR2025-037 PDF
- @inproceedings{Ebbers2025mar,
- author = {Ebbers, Janek and Germain, François G and Wilkinghoff, Kevin and Wichern, Gordon and {Le Roux}, Jonathan},
- title = {{No Class Left Behind: A Closer Look at Class Balancing for Audio Tagging}},
- booktitle = {IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)},
- year = 2025,
- month = mar,
- url = {https://www.merl.com/publications/TR2025-037}
- }
,
- "No Class Left Behind: A Closer Look at Class Balancing for Audio Tagging", IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), March 2025.
-
MERL Contacts:
-
Research Areas:
Abstract:
Large-scale audio tagging datasets like AudioSet usually suffer from severe class imbalance comprising many audio examples for common sound classes but only few examples of rare sound classes. The latter, however, may yet be equally or even more important to recognize. Therefore, it is common practice to sample examples from rare classes more frequently during training. At the same time, the effects of such balancing on a model’s training and tagging performance are still little understood. In this work, we investigate how it affects training convergence and tagging performance. We consider varying degrees of balancing and investigate whether classes converge simultaneously or if there is a benefit from selecting different balancing rates for each class. Furthermore, we investigate data efficient oversampling, which keeps audio files from rare classes in memory, and repeats them in close succession over multiple batches, minimizing data loading from disk. Finally, we show that for AudioSet, the optimal amount of class balancing is different when fine-tuning a model pre-trained via self- supervised learning, versus training a supervised model from scratch.