A Hybrid Approach to Typo Correction in Indonesian Documents Using Levenshtein Distance

Authors

  • Joseph Teguh Santoso University of Science and Computer Technology
  • Song Yan Nanjing University of Information Science and Technology, Nanjing 210044, Chin

DOI:

https://doi.org/10.51903/jtie.v3i2.184

Keywords:

Typo Correction, Levenshtein Distance, Empirical Methods, Natural Language Processing, Indonesian Language

Abstract

This study developed a typo correction system for the Indonesian language by integrating the Levenshtein Distance algorithm with empirical methods. The system is designed to improve the accuracy of typo detection and correction in Indonesian texts, which feature complex morphological structures such as prefixes, suffixes, and compound words. The findings show that the system achieved a precision rate of 92% and an F1-score of 90.5%, indicating high reliability in providing relevant correction suggestions. Additionally, the system demonstrated efficiency in processing time, with an average of 0.8 seconds for short texts and 5.3 seconds for longer texts. The use of empirical methods enables the system to handle complex language variations, resulting in more contextually appropriate correction suggestions. User feedback indicated high satisfaction with the interface and the relevance of the suggestions provided. Overall, this research makes a significant contribution to the development of more adaptive and efficient typo correction systems for the Indonesian language and opens up opportunities for further development in the context of other similar languages.

References

Aligon, J., Golfarelli, M., Marcel, P., Rizzi, S., & Turricchia, E. (2014). Similarity measures for OLAP sessions. Knowledge and Information Systems, 39(2), 463–489. https://doi.org/10.1007/S10115-013-0614-1/METRICS
Berger, B., Waterman, M. S., & Yu, Y. W. (2021). Levenshtein Distance, Sequence Comparison and Biological Database Search. IEEE Transactions on Information Theory, 67(6), 3287–3294. https://doi.org/10.1109/TIT.2020.2996543
Berthelé, R., Lenz, P., & Peyer, E. (2022). Predicting foreign language skills based on first languages: The role of lexical distance and relative morphological complexity. Poznan Studies in Contemporary Linguistics, 58(3), 419–448. https://doi.org/10.1515/PSICL-2022-0020/MACHINEREADABLECITATION/RIS
Bozdag, E. (2013). Bias in algorithmic filtering and personalization. Ethics and Information Technology, 15(3), 209–227. https://doi.org/10.1007/S10676-013-9321-6/METRICS
Bryant, C., Yuan, Z., Qorib, M. R., Cao, H., Ng, H. T., & Briscoe, T. (2023). Grammatical Error Correction: A Survey of the State of the Art. Computational Linguistics, 49(3), 643–701. https://doi.org/10.1162/COLI_A_00478
Chaabi, Y., & Ataa Allah, F. (2022). Amazigh spell checker using Damerau-Levenshtein algorithm and N-gram. Journal of King Saud University - Computer and Information Sciences, 34(8), 6116–6124. https://doi.org/10.1016/J.JKSUCI.2021.07.015
Chen, Y. F., Chocholatý, D., Havlena, V., Holík, L., Lengál, O., & Síč, J. (2023). Solving String Constraints with Lengths by Stabilization. Proceedings of the ACM on Programming Languages, 7(OOPSLA2), 30. https://doi.org/10.1145/3622872
Dashti, S. M. S., Bardsiri, A. K., & Shahbazzadeh, M. J. (2024). Automatic real-word error correction in persian text. Neural Computing and Applications, 1–25. https://doi.org/10.1007/S00521-024-10045-0/METRICS
Espindola, V., Zago, L., Yviquel, H., & Araujo, G. (2023). Source Matching and Rewriting for MLIR Using String-Based Automata. ACM Transactions on Architecture and Code Optimization, 20(2). https://doi.org/10.1145/3571283/ASSET/8FCBA423-7A08-490B-B00D-524D1F77CE66/ASSETS/GRAPHIC/TACO-2022-41-LI13.JPG
Fiscus, J. G., Ajot, J., Radde, N., & Laprun, C. (2006). Multiple Dimension Levenshtein Edit Distance Calculations for Evaluating Automatic Speech Recognition Systems During Simultaneous Speech. LREC. http://www.nist.gov/speech/tools/index.htm
Gou, W., & Chen, Z. (2021). Think Twice: A Post-Processing Approach for the Chinese Spelling Error Correction. Applied Sciences 2021, Vol. 11, Page 5832, 11(13), 5832. https://doi.org/10.3390/APP11135832
Hall, P. A. V., & Dowling, G. R. (1980). Approximate String Matching. ACM Computing Surveys (CSUR), 12(4), 381–402. https://doi.org/10.1145/356827.356830/ASSET/09FD4607-BD36-4BA1-9506-6846BF3E3D22/ASSETS/356827.356830.FP.PNG
Janardhana Rao, P., Nageswara Rao, K., Gokuruboyina, S., & Neeraja, K. N. (2024). An Efficient Methodology for Identifying the Similarity Between Languages with Levenshtein Distance. Lecture Notes in Electrical Engineering, 1096, 161–174. https://doi.org/10.1007/978-981-99-7137-4_15
Jongmans, E., Jeannot, F., Liang, L., & Dampérat, M. (2022). Impact of website visual design on user experience and website evaluation: the sequential mediating roles of usability and pleasure. Journal of Marketing Management, 38(17–18), 2078–2113. https://doi.org/10.1080/0267257X.2022.2085315
Khaw, Y. M. J., Tan, T. P., & Bali, R. M. (2024). Hybrid Distance-Statistical-Based Phrase Alignment For Analyzing Parallel Texts In Standard Malay And Malay Dialects. Malaysian Journal of Computer Science, 37(1), 1–25. https://doi.org/10.22452/MJCS.VOL37NO1.5
Khin, D., & Lecturer, P. (2020). International Journal of Advances in Scientific Research and Engineering (ijasre) Similarity Based Information Retrieval Using Levenshtein Distance Algorithm. https://doi.org/10.31695/IJASRE.2020.33780
Kremer, K., & van Manen, S. M. (2023). Design guidelines to improve user experience (UX) in an emergency: On the importance of affordances, signifiers and feedback. Design for Emergency Management, 49–68. https://doi.org/10.4324/9781003306771-4
Kukich, K. (1992). Techniques for automatically correcting words in text. ACM Computing Surveys (CSUR), 24(4), 377–439. https://doi.org/10.1145/146370.146380
Laouafi, A., Laouafi, F., & Boukelia, T. E. (2022). An adaptive hybrid ensemble with pattern similarity analysis and error correction for short-term load forecasting. Applied Energy, 322, 119525. https://doi.org/10.1016
Lau, R. Y. K., Liao, S. Y., Chi-Wai Kwok, R., Xu, K., Xia, Y., & Li, Y. (2012). Text mining and probabilistic language modeling for online review spam detection. ACM Transactions on Management Information Systems (TMIS), 2(4). https://doi.org/10.1145/2070710.2070716
Liu, Y. (2023). Grammatical Error Correction Incorporating First Language Information. https://doi.org/10.25949/23897178.V1
Mashtalir, S. V., Stolbovoi, M. І., & Yakovlev, S. V. (2019). Hybrid Approach to Clustering Various Lengths Video. Journal of Automation and Information Sciences, 51(3), 26–35. https://doi.org/10.1615/JAUTOMATINFSCIEN.V51.I3.30
Maurer, M. E., & Höfer, L. (2012). Sophisticated Phishers Make More Spelling Mistakes: Using URL Similarity against Phishing. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 7672 LNCS, 414–426. https://doi.org/10.1007/978-3-642-35362-8_31
Mehta, A., Salgond, V., Satra, D., & Sharma, N. (2021). Spell Correction And Suggestion Using Levenshtein Distance. International Research Journal of Engineering and Technology. www.irjet.net
Ortikov, U. (2023). Practical Uses Of Corpus Analysis In Designing Language teaching materials. Oriental Renaissance: Innovative, Educational, Natural and Social Sciences, 3(7). https://cyberleninka.ru/article/n/practical-uses-of-corpus-analysis-in-designing-language-teaching-materials
Pham, N. L., Vinh Nguyen, V., & Pham, T. V. (2023). A Data Augmentation Method for English-Vietnamese Neural Machine Translation. IEEE Access, 11, 28034–28044. https://doi.org/10.1109/ACCESS.2023.3252898
Ribeiro, L. C., Bernardes, A. T., & Mello, H. (2023). On the fractal patterns of language structures. PLOS ONE, 18(5), e0285630. https://doi.org/10.1371/JOURNAL.PONE.0285630
Schede, E., Brandt, J., Tornede, A., Wever, M., Bengs, V., Hüllermeier, E., & Tierney, K. (2022). A Survey of Methods for Automated Algorithm Configuration. Journal of Artificial Intelligence Research, 75, 425–487. https://doi.org/10.1613/JAIR.1.13676
Skopal, T., & Bustos, B. (2011). On nonmetric similarity search problems in complex domains. ACM Computing Surveys (CSUR), 43(4). https://doi.org/10.1145/1978802.1978813
Suwarningsih, W., & Nuryani. (2024a). Generate fuzzy string-matching to build self attention on Indonesian medical-chatbot. International Journal of Electrical and Computer Engineering, 14(1), 819. https://doi.org/10.11591/IJECE.V14I1.PP819-829
Suwarningsih, W., & Nuryani. (2024b). Generate fuzzy string-matching to build self attention on Indonesian medical-chatbot. International Journal of Electrical and Computer Engineering, 14(1), 819. https://doi.org/10.11591
Szudarski, P. (2023). Collocations, Corpora and Language Learning. Elements in Corpus Linguistics. https://doi.org/10.1017/9781108992602
Walker, S. (2014). Typography & language in everyday life: Prescriptions and practices. Routledge.
Wang, L. L., Cachola, I., Bragg, J., Cheng, E. Y.-Y., Haupt, C., Latzke, M., Kuehl, B., van Zuylen, M., Wagner, L., & Weld, D. S. (2021). Improving the Accessibility of Scientific Documents: Current State, User Needs, and a System Solution to Enhance Scientific PDF Accessibility for Blind and Low Vision Users. https://arxiv.org/abs/2105.00076v1
Wang, Y., Wang, Y., & Liu, Y. (2024). Chinese Spelling Correction Method Based on Multi-feature Fusion and Attention Mechanism. Proceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering, 481–487. https://doi.org/10.1145/3672758.3672837
Yang, P., Wang, H., Yang, J., Qian, Z., Zhang, Y., & Lin, X. (2024). Deep Learning Approaches for Similarity Computation: A Survey. IEEE Transactions on Knowledge and Data Engineering. https://doi.org/10.1109/TKDE.2024.3422484
Ye, D., Tian, B., Fan, J., Liu, J., Zhou, T., Chen, X., Li, M., & Ma, J. (2023). Improving Query Correction Using Pre-train Language Model In Search Engines. International Conference on Information and Knowledge Management, Proceedings, 2999–3008. https://doi.org/10.1145/3583780.3614930
Young, S. (1997). Corpus-Based Methods in Language and Speech Processing (B. Gerrit, Ed.; Vol. 2). Springer Science & Business Media.
Zhang, S., Hu, Y., & Bian, G. (2017). Research on string similarity algorithm based on Levenshtein Distance. Proceedings of 2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference, IAEAC 2017, 2247–2251. https://doi.org/10.1109/IAEAC.2017.8054419
Zoya, Latif, S., Latif, R., Majeed, H., & Jamail, N. S. M. (2023). Assessing Urdu Language Processing Tools via Statistical and Outlier Detection Methods on Urdu Tweets. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(10). https://doi.org/10.1145/3622939

Published

2024-08-21

How to Cite

Santoso, J. T., & Yan, S. (2024). A Hybrid Approach to Typo Correction in Indonesian Documents Using Levenshtein Distance. Journal of Technology Informatics and Engineering, 3(2), 151–168. https://doi.org/10.51903/jtie.v3i2.184