

Integrating lip dynamics into visual speech framework
Abstract
Visual Speech Recognition (VSR) is a rapidly evolving field with diverse applications in human-computer interaction, accessibility, and security. This paper presents an innovative approach to VSR, focusing on the extraction and analysis of lip movements for speech recognition. Traditional speech recognition systems rely primarily on acoustic information, making them vulnerable to noisy environments and audio disturbances. In contrast, our proposed method leverages the visual modality by harnessing the rich information encoded in lip movements during speech production. The study begins by collecting a comprehensive dataset of visual and audio recordings of speech in various languages and contexts. Subsequently, a deep learning architecture is designed to process the visual data, emphasizing lip movements, and the corresponding audio data. The proposed model integrates convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to extract and fuse information from both modalities. This fusion process enhances the robustness of the system by mitigating the limitations of traditional audio-only speech recognition. We evaluate the performance of the visual- based speech recognition system on a range of benchmark datasets and real-world scenarios. The results demonstrate the efficacy of our approach, highlighting its capacity to improve recognition accuracy, particularly in noisy environments or situations where audio data is incomplete or unavailable. In conclusion, our research contributes to the advancement of Visual Speech Recognition by introducing a novel approach that emphasizes lip movement analysis. By leveraging both audio and visual modalities, the proposed system provides a more robust and versatile solution for speech recognition, with the potential to enhance applications in human-computer interaction, accessibility, and security.
References
Fenghour, S., Chen, D., Guo, K. and Xiao, P., 2020. Lip reading sentences using deep learning with only visual cues. IEEE Access, 8, pp.215516-215530.
Wand, M., Koutník, J. and Schmidhuber, J., 2016, March.
Lipreading with long short-term memory. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6115-6119). IEEE.
Potamianos, G., Neti, C., Luettin, J. and Matthews, I., 2004. Audio-visual automatic speech recognition: An overview. Issues in visual and audio-visual speech processing, 22, p.23.
Cox, S.J., Harvey, R.W., Lan, Y., Newman, J.L. and Theobald, B.J., 2008, September. The challenge of multispeaker lip-reading. In AVSP (pp. 179-184).
Hilder, S., Harvey, R.W. and Theobald, B.J., 2009, September. Comparison of human and machine-based lip-reading. In AVSP (pp. 86-89).
Chung, J.S., Senior, A., Vinyals, O. and Zisserman, A., 2017, July. Lip reading sentences in the wild. In 2017 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3444-3453). IEEE
P. Mohanaiah, P. Sathyanarayana and L. GuruKumar, Feature Extraction using GLCM”, International Journal of Research Publications, Vol. 3, Issue 5, 2013.
View Wen Chin, Li-Minn Ang and Kah Phooi Seng, “Lips Detection for Audio-Visual Speech Recognition System”, International Symposium on Intelligent Signal Processing and Communication Systems, 2008.
Yong-Ki Kim, Jong Gwan Lim and Mi-Hye Kim, “Comparison of Lip Image Visual Speech Recognition Based on Lip Movement for Indian Languages 2041 Feature Extraction Methods for Improvement of Isolated Word Recognition Rate”, Advanced Science and Technology Letters Vol. 107, pp. 57-61, 2015.
Seman, N., Bakar, Z.A., Bakar, N.A., "An evaluation of endpoint detection measures for Malay speech recognition of isolated words," Information Technology (ITSim), 2010 International Symposium, Vol. 3, pp. 1628-1635, 2010.
Refbacks
- There are currently no refbacks.