Intelligent Deep Learning Empowered Text Detection Model from Natural Scene Images

S. Kiruthika Devi, Subalalitha CN


The scene Text Recognition process has become a hot research topic and a challenging task owing to the complicated background, varying light intensities, colors, font styles, and sizes. Text extraction from natural scene images encompasses two main processes: text detection and text recognition. The latest advancements in Machine Learning (ML) and Deep Learning (DL) concepts can effectually automate the text detection and recognition process by training the model properly. In this view, this paper presents an Automated DL empowered Text Detection model from Natural Scene Images (ADLTD-NSI). The ADLTD-NSI technique includes two important processes: text detection and text recognition. Firstly, a single shot detector (SSD) with Inception-v2 as a baseline model is employed for text detection, an object detector based on the VGG-16 framework for feature map extraction followed by six convolution layers. Secondly, Convolutional Recurrent Neural Network (CRNN) technique is utilized for the text recognition process. Besides, the recurrent layers in the CRNN model utilize long short-term memory (LSTM) for encoding the sequence of feature vectors. Lastly, Connectionist Temporal Classification (CTC) loss is applied to predict text labels equivalent to the sequences from the recurrent layers. A wide range of experiments was carried out on benchmark COCO datasets, and the results are examined in several aspects. The experimental outcomes showcased the better performance of the ADLTD-NSI technique over the other compared methods with a maximum accuracy of 96.78%.


Deep learning; natural scene images; text detection; text recognition; COCO dataset; CRNN model; CTC loss.

Full Text:



M. Ghosh, S. Chatterjee, H. Mukherjee, S. Sen, and S. M. Obaidullah, "Text/Non-text Scene Image Classification Using Deep Ensemble Network," in Proceedings of International Conference on Advanced Computing Applications, 2022, pp. 561–570.

L. M. Francis and N. Sreenath, "TEDLESS – Text detection using least-square SVM from natural scene," Journal of King Saud University - Computer and Information Sciences, vol. 32, no. 3, pp. 287–299, 2020, doi: 10.1016/j.jksuci.2017.09.001.

J. Diaz-Escobar and V. Kober, "Natural Scene Text Detection and Segmentation Using Phase-Based Regions and Character Retrieval," Mathematical Problems in Engineering, vol. 2020, 2020, doi: 10.1155/2020/7067251.

X. Zhang, X. Gao, and C. Tian, "Text detection in natural scene images based on color prior guided MSER," Neurocomputing, vol. 307, pp. 61–71, 2018, doi: 10.1016/j.neucom.2018.03.070.

S. Y. Arafat and M. J. Iqbal, "Urdu-Text Detection and Recognition in Natural Scene Images Using Deep Learning," IEEE Access, vol. 8, no. June, pp. 96787–96803, 2020, doi: 10.1109/ACCESS.2020.2994214.

M. Liao et al., "Scene text recognition from two-dimensional perspective," 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, pp. 8714–8721, 2019, doi: 10.1609/aaai.v33i01.33018714.

X. Zhu, J. Wang, Z. Hong, T. Xia, and J. Xiao, "Federated learning of unsegmented chinese text recognition model," Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI, vol. 2019-November, no. 2018, pp. 1341–1345, 2019, doi: 10.1109/ICTAI.2019.00186.

H. Zhang, Q. Yao, M. Yang, Y. Xu, and X. Bai, "AutoSTR: Efficient Backbone Search for Scene Text Recognition," Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12369 LNCS, pp. 751–767, 2020, doi: 10.1007/978-3-030-58586-0_44.

J. Zhang, C. Luo, L. Jin, T. Wang, Z. Li, and W. Zhou, "SaHAN: Scale-aware hierarchical attention network for scene text recognition," Pattern Recognition Letters, vol. 136, pp. 205–211, 2020, doi: 10.1016/j.patrec.2020.06.009.

D. V. Sang and L. T. B. Cuong, "Improving CRNN with EfficientNet-like feature extractor and multi-head atention for text recognition," ACM International Conference Proceeding Series, no. December, pp. 285–290, 2019, doi: 10.1145/3368926.3369689.

Q. Lin, C. Luo, L. Jin, and S. Lai, "STAN: A sequential transformation attention-based network for scene text recognition," Pattern Recognition, vol. 111, p. 107692, 2021, doi: 10.1016/j.patcog.2020.107692.

A. Mirza, O. Zeshan, M. Atif, and I. Siddiqi, "Detection and recognition of cursive text from video frames," Eurasip Journal on Image and Video Processing, vol. 2020, no. 1, 2020, doi: 10.1186/s13640-020-00523-5.

A. Aberdam et al., "Sequence-to-Sequence Contrastive Learning for Text Recognition," 2020.

R. Harizi, R. Walha, F. Drira, and M. Zaied, "Convolutional neural network with joint stepwise character/word modeling based system for scene text recognition," Multimedia Tools and Applications, 2021, doi:

C. Luo, L. Jin, and Z. Sun, "MORAN: A Multi-Object Rectified Attention Network for scene text recognition," Pattern Recognition, vol. 90, pp. 109–118, 2019, doi: 10.1016/j.patcog.2019.01.020.

X. Zhou et al., "EAST: An efficient and accurate scene text detector," Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017-January, pp. 2642–2651, 2017, doi: 10.1109/CVPR.2017.283.

U. Alganci, M. Soydas, and E. Sertel, "Comparative research on deep learning approaches for airplane detection from very high-resolution satellite images," Remote Sensing, vol. 12, no. 3, 2020, doi: 10.3390/rs12030458.

R. Suresh and N. Keshava, "A Survey of Popular Image and Text analysis Techniques," CSITSS 2019 - 2019 4th International Conference on Computational Systems and Information Technology for Sustainable Solution, Proceedings, 2019, doi: 10.1109/CSITSS47250.2019.9031023.

F. Zhang, J. Luan, Z. Xu, and W. Chen, "DetReco: Object-Text Detection and Recognition Based on Deep Neural Network," Mathematical Problems in Engineering, vol. 2020, 2020, doi: 10.1155/2020/2365076.

Y. Liu, Z. Wang, H. Jin, and I. Wassell, "Synthetically Supervised Feature Learning for Scene Text Recognition," Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 11209 LNCS, pp. 449–465, 2018, doi: 10.1007/978-3-030-01228-1_27.

L. Chen and S. Li, "Improvement research and application of text recognition algorithm based on CRNn," ACM International Conference Proceeding Series, pp. 166–170, 2018, doi: 10.1145/3297067.3297073.



  • There are currently no refbacks.

Published by INSIGHT - Indonesian Society for Knowledge and Human Development