

Comparative Study of Audio Features for Intelligent Sound Recognition
Abstract
The accurate classification of environmental and urban sounds is an important problem due to its relevance in applications such as surveillance, smart cities, healthcare, and multimedia retrieval. This work explores how different audio features affect classification accuracy in machine learning systems. The proposed solution employs a Convolutional Neural Network (CNN) to classify urban sound events using various audio feature representations. The main research contribution lies in the comparative analysis of multiple audio features and their effectiveness in improving CNN-based classification. This study uses the UrbanSound8K dataset, which contains 10 common urban sound classes. Key audio features MFCCs, log-Mel spectrograms, and Chroma are extracted and used to train separate CNN models. The performance of each feature type is then evaluated and compared. Results show that MFCCs and log-Mel spectrograms significantly outperform Chroma features in classification accuracy. MFCCs provide a compact and informative representation of audio signals, while log-Mel spectrograms capture time-frequency characteristics effectively. Models trained with these features exhibit robust performance across varied acoustic conditions. The results indicate that appropriate feature selection can lead to improved model generalization and efficiency. In conclusion, this study emphasizes the critical role of audio feature selection in urban sound classification and provides practical guidelines for developing efficient machine listening systems.
References
A. Mesaros, T. Heittola, and T. Virtanen, "Acoustic scene classification: An overview of DCASE challenges," New Trends in Audio and Speech Processing, 2019.
S. Salamon and J. P. Bello, "Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification," IEEE Signal Processing Letters, vol. 24, no. 3, pp. 279–283, Mar. 2017.
D. Ellis, "PLP and RASTA (and MFCC, and inversion) in Matlab," 2005. [Online]. Available: http://www.ee.columbia.edu/~dpwe/resources/matlab/rastamat/
B. Logan, "Mel Frequency Cepstral Coefficients for Music Modeling," in Proc. International Symposium on Music Information Retrieval (ISMIR), 2000.
D. P. W. Ellis, "Log-Mel Spectrogram Features for Speech and Audio Processing," Columbia University, 2007. [Online]. Available: https://labrosa.ee.columbia.edu/matlab/logmelfcc/
T. Fujishima, "Realtime chord recognition of musical sound: a system using Common Lisp Music," in Proc. International Computer Music Conference (ICMC), 1999.
Y. Aytar, C. Vondrick, and A. Torralba, "SoundNet: Learning Sound Representations from Unlabeled Video," in Advances in Neural Information Processing Systems (NeurIPS), 2016, pp. 892–900.
J. Salamon, C. Jacoby, and J. P. Bello, "A Dataset and Taxonomy for Urban Sound Research," in Proc. ACM International Conference on Multimedia, 2014, pp. 1041–1044
McCradden, M. D., & Goldenberg, A. (2019). “What clinicians want: Contextualizing explainable machine learning for clinical end use”. Proceedings of Machine Learning Research, 106, 359–380.
Lundberg, S. M., & Lee, S.-I. (2017). “A unified approach to interpreting model predictions”. Advances in Neural Information Processing Systems (NeurIPS), 4765–4774.
Refbacks
- There are currently no refbacks.