Open Access Open Access  Restricted Access Subscription Access

Improving the Precision of Tesseract-OCR Engine

Aaransh Jain

Abstract


Picture Documents checked or caught by computerized cameras on cell phones experience the ill effects of various restrictions like mathematical twists, center misfortune, lopsided lightning conditions, low filtering goal and so on On account of these limits, the nature of picture reports is regularly corrupted and along these lines, the acknowledgment precision of OCR motors gets impacted. This work centers around working on the acknowledgment of Tesseract-OCR motor for Nepali picture reports by means of preprocessing. For this reason, we fostered a picture preprocessing pipeline comprising of 8 stages and tried with a few Nepali text pictures which were gathered from various sources like Nepali news corpus, books, printed records and so on Our experimental outcomes showed that the acknowledgment precision improved from 90.69%, 54.34% and 38.45 to 94.84%, 71.15% and 51.21% separately for high, medium and inferior quality pictures.


Full Text:

PDF

References


Khedekar, S., Ramanaprasad, V., Setlur, S., & Govindaraju, V. (2003, August). Text-image separation in devanagari documents. In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings. (pp. 1265-1269). IEEE.

Kompalli, S., Nayak, S., Setlur, S., & Govindaraju, V. (2005, August). Challenges in OCR of Devanagari documents. In Eighth International Conference on Document Analysis and Recognition (ICDAR'05) (pp. 327-331). IEEE.

Smith, R. (2007). An Overview of the Tesseract OCR Engine. In proceedings of Document analysis and Recognition. ICDAR.

Smith, R. (2007, September). An overview of the Tesseract OCR engine. In Ninth international conference on document analysis and recognition (ICDAR 2007) (Vol. 2, pp. 629-633). IEEE.

Alginahi, Y. (2010). Preprocessing Techniques in Character Recognition, Character Recognition, Minoru Mori (Ed.), ISBN: 978-953-307-105-3, InTech.

Bansal, V., & Sinha, M. K. (2001, September). A complete OCR for printed Hindi text in Devanagari script. In Proceedings of Sixth International Conference on Document Analysis and Recognition (pp. 0800-0800). IEEE Computer Society.

Yadav, D., Sánchez-Cuadrado, S., & Morato, J. (2013). Optical character recognition for Hindi language using a neural-network approach. JIPS, 9(1), 117-140.

Gupta, D., & Nair, L. (2013). Improving OCR By Effective Pre-Processing and Segmentation for Devanagiri Script: A Quantified Study. Journal of Theoretical & Applied Information Technology, 52(2).

Badla, S. (2014). Improving the efficiency of Tesseract OCR Engine.

Bawa, R. K., & Sethi, G. K. (2014). A binarization technique for extraction of devanagari text from camera based images. Signal & Image Processing, 5(2), 29.


Refbacks

  • There are currently no refbacks.