

Feelings in Pixels: Exploring Multimodal Sentiments
Abstract
Traditional sentiment analysis has largely focused on text, often missing the deeper emotional context found in audio, images, and video. This paper introduces a multimodal sentiment analysis framework that brings together text, speech, images, and video to capture a fuller picture of human emotions. For text, the system uses VADER (Valence Aware Dictionary and Sentiment Reasoner); for images, it employs Convolutional Neural Networks (CNNs) to recognize facial expressions; and for audio and video, it leverages real-time processing techniques to detect dynamic emotional shifts. The proposed approach achieved an overall accuracy of 80%, with strong precision and recall across different input types. Still, challenges like class imbalance (e.g., more "Happy" samples than "Disgust") and overfitting (with validation accuracy peaking at 60%) point to areas needing refinement. The potential applications of this framework are wide-ranging from improving customer engagement to supporting mental health monitoring and enhancing social media insights. Looking ahead, future research will investigate transformer-based models and address the ethical implications of real-time emotion analysis.
References
Baccianella, S., Esuli, A., & Sebastiani, F. (2010). SentiWordNet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining. Proceedings of LREC, 10, 2200–2204.
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). *wav2vec 2.0: A framework for self-supervised learning of speech representations*. Advances in Neural Information Processing Systems, 33, 12449–12460.
Chen, T., Borth, D., Darrell, T., & Chang, S. F. (2014). DeepSentiBank: Visual sentiment concept classification with deep convolutional neural networks. arXiv preprint arXiv:1410.8586.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Goodfellow, I. J., Erhan, D., Carrier, P. L., Courville, A., Mirza, M., Hamner, B., ... & Zhou, Y. (2013). Challenges in representation learning: A report on three machine learning contests. Neural Networks, 64, 59–63.
Hutto, C., & Gilbert, E. (2014). VADER: A parsimonious rule-based model for sentiment analysis of social media text. Proceedings of ICWSM, 8(1), 216–225.
Mollahosseini, A., Hasani, B., & Mahoor, M. H. (2017). AffectNet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1), 18–31.
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up? Sentiment classification using machine learning techniques. Proceedings of EMNLP, 79–86.
Parkhi, O. M., Vedaldi, A., & Zisserman, A. (2015). Deep face recognition. Proceedings of BMVC, 1(3), 6.
Poria, S., Cambria, E., Hazarika, D., Majumder, N., Zadeh, A., & Morency, L. P. (2017). Context-dependent sentiment analysis in user-generated videos. Proceedings of ACL, 873–883.
Refbacks
- There are currently no refbacks.