Similarity Microservices with Automated Tag Generation
Abstract
A scalable web-based application addresses the problem of identifying contextually similar text posts and automatically generating relevant tags using Natural Language Processing (NLP) techniques, integrated within a modular and microservices architecture. The system enables users to create and manage text-based posts. Upon submission, two critical backend services are triggered: the Similarity Engine and the Tag Generator. The Similarity Engine converts post content into sentence embeddings using models like Sentence Transformer, and uses cosine similarity to compare the new post with all existing posts. A similarity threshold of 75% to 85% is applied. If the similarity score exceeds this threshold, the matched posts are recommended to the user; otherwise, the new post is added. This allows the application to recommend semantically related posts, even if they use entirely different wording.
Simultaneously, the Tag Generator microservice uses the RAKE algorithm to extract the most important keywords from the post content, which are then used as tags for categorization, filtering, and search optimization. This process is entirely automated and does not rely on manually assigned labels. This approach demonstrates effective semantic text processing and modern software engineering practices, making it adaptable for applications such as search engines, recommendation systems, e-learning platforms, and customer support tools.
References
Sharma, A., et al. (2024). Automated tag generation using natural language processing and machine learning techniques.
Bhardwaj, A., Hasan, R., & Mahmood, S. (2025). Semantic similarity in community forum questions: Case study on Quora dataset. Journal of Umm Al-Qura University for Engineering and Architecture, 16, 1719–1728.
Cann, T. J. B., Dennes, B., Coan, T., O'Neill, S., & Williams, H. T. P. (2025). Using semantic similarity to measure the echo of strategic communications. EPJ Data Science, 14, Article 20.
Bansal, S., Gowda, K., Sureshbabu K, A., Kothari, C., & Kumar, N. (2025). A comprehensive review on hashtag recommendation: From traditional to deep learning and beyond. arXiv preprint arXiv:2503.18669.
Khandelwal, T. (2025). Using LLM-based approaches to enhance and automate topic labeling. arXiv preprint arXiv:2502.18469.
Ravfogel, S., Pyatkin, V., Cohen, A. D. N., Manevich, A., & Goldberg, Y. (2024). Description-based text similarity. In Proceedings of the Conference on Language Modeling (COLM).
Vasileiou, A., & Eberle, O. (2024). Explaining text similarity in transformer models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL) (pp. 7859–7873).
Do, L. T., Akash, P. S., & Chang, K. C.-C. (2023). Unsupervised open-domain keyphrase generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), Volume 1: Long Papers (pp. 10614–10627).
Refbacks
- There are currently no refbacks.