

AUTOMATIC IMAGE AND VIDEO CAPTION GENERATION USING DEEP LEARNING
Abstract
Deep learning has driven major progress in automatically generating captions for images and videos, addressing challenges in imaging science and enabling applications such as assistive technology, media indexing, and robotic vision. This review highlights key methodologies in image and video captioning, focusing on shared foundations like CNNs, LSTMs, and GANs for visual feature extraction and language generation. Video captioning extends image captioning by modeling temporal dependencies across frames. Advances such as attention mechanisms and transformer models have improved contextual understanding and coherence. These techniques demonstrate the potential of deep learning to bridge visual content and natural language, opening new avenues in accessibility and intelligent systems.
References
J. Redmon and A. Farhadi, ‘‘YOLOv3: An incremental improvement,’’ 2018, arXiv: 1804.02767. [Online]. Available: http://arxiv.org/ abs/1804.02767
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, ‘‘the cityscapes dataset for semantic urban scene understanding,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 3213–3223.
H. R. Arabnia and M. A. Oliver, ‘‘Fast operations on raster images with SIMD machine architectures,’’ in Computer Graphics Forum, vol. 5, Hoboken, NJ, USA: Wiley, 1986, pp. 179–188, doi: 10.1111/j.1467- 8659.1986.tb00296.x.
S. M. Ehandarkar and H. R. Arabnia, ‘‘Parallel computer vision on a reconfigurable multiprocessor network,’’ IEEE Trans. Parallel Distrib. Syst., vol. 8, no. 3, pp. 292–309, Mar. 1997.
H. Valafar, H. R. Arabnia, and G. Williams, ‘‘Distributed global optimization and its development on the multiring network,’’ Neural, Parallel Sci. Comput., vol. 12, no. 4, pp. 465–490, 2004.
D. Luper, D. Cameron, J. Miller, and H. R. Arabnia, ‘‘Spatial and temporal target association through semantic analysis and GPS data mining.’’ in Proc. IKE, vol. 7, 2007, pp. 25–28.
R. Jafri and H. R. Arabnia, ‘‘Fusion of face and gait for automatic human recognition,’’ in Proc. 5th Int. Conf. Inf. Technol., New Generat., vol. 1, Apr. 2008, pp. 167–173.
H. R. Arabnia, W.-C. Fang, C. Lee, and Y. Zhang, ‘‘Context-aware middleware and intelligent agents for smart environments,’’ IEEE Intell. Syst., vol. 25, no. 2, pp. 10–11, Mar. 2010.
R. Jafri, S. A. Ali, and H. R. Arabnia, ‘‘Computer vision-based object recognition for the visually impaired using visual tags,’’ in Proc. Int. Conf. Image Process., Comput. Vis., and Pattern Recognit. (IPCV). Steering Committee World Congr. Comput. Sci., Comput. Eng. Appl. Comput. (WorldComp), 2013, p. 1.
L. Deligiannidis and H. R. Arabnia, ‘‘Parallel video processing techniques for surveillance applications,’’ in Proc. Int. Conf. Comput. Sci. Comput. Intell., Mar. 2014, pp. 183–189.
Refbacks
- There are currently no refbacks.