Systematic review of spell-checkers for highly inflectional languages

  • Published: 14 November 2019
  • Volume 53 , pages 4051–4092, ( 2020 )

Cite this article

spell checker research paper

  • Shashank Singh 1 &
  • Shailendra Singh 1  

1616 Accesses

8 Citations

Explore all metrics

Performance of any word processor, search engine, social media relies heavily on the spell-checkers, grammar checkers etc. Spell-checkers are the language tools which break down the text to check the spelling errors. It cautions the user if there is any unintentional misspelling occurred in the text. In the area of spell-checking, we still lack an exhaustive study that covers aspects like strengths, limitations, handled errors, performance along with the evaluation parameters. In literature, spell-checkers for different languages are available and each one possesses similar characteristics however, have a different design. This study follows the guidelines of systematic literature review and applies it to the field of spell-checking. The steps of the systematic review are employed on 130 selected articles published in leading journals, premier conferences and workshops in the field of spell-checking of different inflectional languages. These steps include framing of the research questions, selection of research articles, inclusion/exclusion criteria and the extraction of the relevant information from the selected research articles. The literature about spell-checking is divided into key sub-areas according to the languages. Each sub-area is then described based on the technique being used. In this study, various articles are analyzed on certain criteria to reach the conclusion. This article suggests how the techniques from the other domains like morphology, part-of-speech, chunking, stemming, hash-table etc. can be used in development of spell-checkers. It also highlights the major challenges faced by researchers along with the future area of research in the field of spell-checking.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

spell checker research paper

Similar content being viewed by others

spell checker research paper

Urdu Spell Checker: A Scarce Resource Language

spell checker research paper

Automatic Spelling Detection and Correction in the Medical Domain: A Systematic Literature Review

spell checker research paper

Context Sensitive Tamil Language Spellchecker Using RoBERTa

Abbreviations.

Finite state machine

Dictionary lookup method

Morphological analysis

Edit distance

Minimum edit distance

Unicode splitting

Character-based longest short term memory

Soundex method

Levenstein edit distance

Confusion set

Reverse minimum edit distance

Direct dictionary lookup method

Edit distance method

Phonetic encoding method

Finite state representation

State table method

Finite state automata

Partition around medoid clustering

Double metaphone encoding

Word frequency

Sound and shape similarity

Reverse edit distance method

Tree-based algorithm

Parts of speech

Hidden Markov model

Graphical user interface

Finite state transition

Unknown word handling

Unknown proper noun handling

Application programming interface

Constituent word

Memory based language model

Finite state transition model

Dictionary approach

Canti check

Crowd sourcing

Abdullah M, Islam Z, Khan M (2007) Error-tolerant finite-state recognizer and string pattern similarity based spelling-checker for Bangla. In: Proceeding of 5th international conference on natural language processing (ICON)

Abeera VP, Aparna S, Rekha RU, Kumar MA, Dhanalakshmi V (2012) Morphological analyzer for Malayalam. In: Data engineering and management, pp 252–254

Allen JD et al (2012) The unicode standard, vol 3. Mountain view, CA

Google Scholar  

Ambili T, Panchami KS, Subash N (2016) Automatic error detection and correction in Malayalam. IJSTE Int J Sci Technol Eng 3(02):92–96

Angell RC, Freund GE, Willett P (1983) Automatic spelling correction using a tri-gram similarity measure. Inf Process Manag 19(4):255–261

Badugu S (2014) Morphology based POS tagging on Telugu. Int J Comput Sci Issues 11(1):181–187

Balabantaray C, Sahoo B, Swain M, Sahoo K (2012) IIIT-Bh FIRE 2012 submission: MET Track Odia, pp 1–3

Banks T (2008) Strategies, foreign language larning difficulaties and teching. Dominican University of California, San Rafael

Bansal A, Banerjee E, Jha GN (2013) Corpora creation for Indian language technologies—The ILCI Project. In: The 6th proceedings of language technology conference (LTC ‘13)

Bhatti Z, Ismaili IA (2016) Phonetic-based Sindhi spell-checker system using a hybrid model. Digit Scholarsh Humanit 31(2):264–282

Bhatti Z, Ismaili IA, Shaikh AA, Javaid W (2012) Spelling error trends and patterns in Sindhi. J Emerg Trends Comput Inf Sci 3(10):1435–1439

Bhatti Z, Ismaili IA, Soomro WJ, Hakro DN (2014) Word segmentation model for Sindhi text. Am J Comput Res Repos 2(1):1–7

Bhowmik K (2014) Development of a word-based spell-checker for Bangla language. Military Institute of Science and Technology, United International University, Dhaka

Borah PP, Talukdar G, Baruah A (2014) Assamese word sense disambiguation using supervised learning. In: International conference on contemporary computing and informatics (IC3I). IEEE, pp 946–950

Bruno M, Silva MJ (2004) Spelling correction for search engine queries. In: Advanced natural language processing. Springer, Berlin, pp 372–383

Budanitsky A, Hirst G (2006) Evaluating wordnet-based measures of lexical semantic relatedness. Comput Linguist 32(1):13–47

MATH   Google Scholar  

Budgen D, Brereton P (2006) Performing systematic literature reviews in software engineering. In: ICSE’06 Proceedings of the 28th international conference on Software engineering, pp 1051–1052

Cavnar WB, Trenkle JM (1994) N-gram-based text categorization. In: Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, pp 161–175

Chakrabarti B (1994) A comparative study of Santali and Bengali. K.P. Bagchi & Co., Kolkata

Chaudhuri BB (2001) Reversed word dictionary and phonetically similar word grouping based spell-checker to Bangla text. In: Proceeding of LESAL Workshop, Mumbai

Chaudhuri BB (2002) Towards Indian language spell-checker design. In: Proceedings—language engineering conference, LEC 2002, pp 139–146

Choudhury R, Deb N, Kashyap K (2019) Context sensitive spelling checker for Assamese language. In: Kalita J, Balas V, Borah S, Pradhan R (eds) Recent developments in machine learning and data analytics. Springer, Singapore, pp 177–188

Cordeiro de Amorim R, Zampieri M (2013) Recent advances in natural language processing. In: IEEE international conference on recent advances in natural language processing, pp 172–178

Dahar IA, Abbas F, Rajput U, Hussain A, Azhar F (2018) An efficient Sindhi spelling checker for microsoft word. Int J Comput Sci Netw Secur 18(5):144–150

Damerau FJ (1964) A technique for computer detection and correction of spelling errors. Commun ACM 7(3):171–176

Das M, Borgohain S, Gogoi J, Nair SB (2002a) Design and implementation of a spell-checker for Assamese. In: Language engineering conference, Proceedings IEEE, pp 156–162

Das M, Borgohain S, Gogoi J (2002b) Design and implementation of a spell-checker for Assamese. In: Language engineering conference, proceedings IEEE, pp 156–162

Das M, Borgohain S, Gogoi J, Nair SB (2002c) Design and implementation of a spell-checker for Assamese. In: Proceedings—language engineering conference, LEC 2002, pp 156–162

Daud A, Khan W, Che D (2016) Urdu language processing: a survey. Artif Intell Rev 47(3):279–311

Dhanabalan T, Parthasarathi R, Geetha TV (2003) Tamil spell-checker. In: 6th Tamil internet conference, Chennai, Tamilnadu, India, pp 18–27

Dhanju KS, Lehal GS, Saini TS, Kaur A (2015) Design and implementation of Punjabi spell-checker. Int J Sci Technol 8(27):1–12

Dongre VJ, Mankar VH (2010) A review of research on Devnagari character recognition. Int J Comput Appl 12(2):8–15

Dowlagar S, Mamidi R (2015) A semi supervised dialog act tagging for Telugu. In: Proceedings of the 12th international conference on natural language processing, pp 376–383

Etoori P, Chinnakotla M, Mamidi R (2018) Automatic spelling correction for resource-scarce languages using deep learning. In: Proceeding of ACL 2018, Student research workshop, pp 146–152

Fossati F, Di Eugenio B (2007) I saw TREE trees in the park: how to correct real-word spelling mistakes. In: LREC , pp 896–901

Ganfure GO, Midekso D (2014) Design and implementation of morphology based spell-checker. Int J Sci Technol Res 3(12):118–125

Ghafour HHA, El-bastawissy A, Heggazy AFA (2011) AEDA : Arabic edit distance algorithm towards a new approach for Arabic name matching. In: IEEE, international conference on computer engineering and systems, pp 307–311

Gokcay E, Gokcay D (1995) Combining statistics and heuristics for language identification. In: Proceedings of the 4th annual symposium on document analysis and information retrieval

Gottron T, Lipka N (2010) A comparison of language identification approaches on short, query-style texts. Lecture notes in computer science, pp 611–614

Goyal V, Lehal GS (2010) Automatic standardization of spelling variations of Hindi text. In: International conference on computer and communication technology ICCCT 2010, pp 764–767

Gupta V (2014) Automatic stemming of words for Punjabi. In: Advances in signal processing and intelligent recognition systems, pp 73–84

Gupta P, Goyal V (2009) Implementation of rule-based algorithm for Sandhi-Vicheda of compound Hindi words. Int J Comput Sci Issues 3:45–49

Gupta V, Lehal GS (2011) Punjabi language stemmer for nouns and proper names. In: Proceedings of the 2nd workshop on South and Southeast Asian Natural Language Processing (WSSANLP), IJCNLP 2011, pp 35–39

Gupta V, Lehal GS (2019) Complete pre processing phase of Punjabi text extractive summarization system. In: Proceedings of COLING 2012: demonstration papers, pp 199–206

Harrison GL, Goegan LD, Jalbert R, Mcmanus K, Sinclair K, Spurling J (2016) Predictors of spelling and writing skills in first and second language learners. Read Writ 29(1):69–89

Hassan A, Amin MR, Al Azad AK, Mohammed N (2017) Sentiment analysis on Bangla and Romanized Bangla text using deep recurrent models. In: IWCI 2016—2016 international workshop on computational intelligence, pp 51–56

Hayes B, Lahiri A (1991) Bengali international phonology. Nat Lang Linguist Theory 9(1):47–96

Hema PH, Sunitha C (2016) Malayalam spell-checker using N-gram method. In: Computational intelligence in data mining-advances in intelligent systems and computing, vol 1, pp 217–225

Heshaam F (2010) Detection and correction of real-word spelling errors in Persian language. In: IEEE-international conference on natural language processing and knowledge engineering (NLP-KE)

Hoque T, Kaykobad M (2002) Coding system for Bangla spell-checker. In: 5th international conference on computer and information technology, pp 186–190

Huang A (2008) Similarity measures for text document clustering. In: Proceedings of the 6th New Zealand computer science research student conference (NZCSRSC2008)

Humayoun M, Ranta A (2014) Developing Punjabi morphology, corpus and lexicon. In: Proceedings of the 24th Pacific Asia conference on language, information and computation

Hussain I, Saharia N, Sharma U (2011) Development of assamese wordnet. In: Nath B, Sharma U, Bhattacharyya DK (eds) Machine intelligence: recent advances. Narosa Publishing House, ISBN-978-81-8487-140-1

Iqbal S, Anwar W, Bajwa UI, Rehman Z (2013) Urdu spell-checking: reverse edit distance approach. In: Proceedings of the 4th workshop on South and Southeast Asian Natural Language Processing, pp 58–65

Islam A, Inkpen D (2009) Real-word spelling correction using google web 1T 3-grams. In: EMNLP’09, conference on empirical methods in natural language processing, pp 1241–1249

Islam MZ, Uddin M, Khan M (2007) A light weight stemmer for Bengali and its use in spelling checker. In: Proceedings of international conference on digital communication and computer applications (DCCA), pp 19–23

Jain A, Jain M (2014) Detection and correction of non-word spelling errors in Hindi language. In: International conference on data mining and intelligent computing (ICDMIC)

Jain U, Kaur J (2015) Text chunker for Punjabi. Int J Curr Eng Technol 5(5):3349–3353

Jananie S, Sarveswaran K (2014) Hybrid approach for spell-checking of Tamil language. In: Proceedings of the Peradeniya University, International Research Session, vol 18, no 1

Jindal S (2017) Building English–Punjabi parallel corpus for machine translation. Int J Comput Appl 180(8):26–29

Justin Z, Dart P (1995) Finding approximate matches in large lexicons. Softw Pract Exp 25(3):331–345

Kabeer R, Idicula SM (2014) Text summarization for Malayalam documents—an experience. In: Proceedings of international conference on data science and engineering, ICDSE 2014, pp 145–150

Kashyap K, Sarma H, Sarma SK (2015) Luitspell: development of an Assamese language spell-checker for open office writer. Eur J Adv Eng Technol 2(5):135–138

Kashyap L, Joshi SR, Bhattacharyya P (2017) Insights on Hindi WordNet coming from the IndoWordNet. In: The Wordnet in Indian languages, pp 19–43

Kaur H, Kaur G, Kaur M (2015) Punjabi spell-checker using dictionary clustering. Int J Sci Eng Technol Res 4(7):2369–2374

Keselj V, Peng F, Cercone N, Thomas C (2003) N-gram based author profiles for authorship attribution. In: Proceedings of the conference of the Pacific association for computational linguistics (PACLING)

Khan NH, Saha GC (2014) Checking the correctness of Bangla words using N-gram. Int J Comput Appl 89(11):1–3

Kleenankandy J (2014) Implementation of Sandhi-rule based compound word generator for Malayalam. In: Proceedings of 4th international conference on advances in computing and communications, ICACC 2014, pp 134–137

Kukich K (1992) Technique for automatically correcting words in text. ACM Comput Surv 24(4):377–439

Kumar SS, Suma S, Sneha N (2017) Spell-checker for Kannada OCR. Int Digit Libr Technol Res 1(4):1–12

Lakshmi K, Babu T (2018) A new hybrid algorithm for Telugu word retrieval and recognition. Int J Intell Eng Syst 11(4):117–127

Lawaye AA, Purkayastha BS (2016) Design and implementation of spell-checker for Kashmiri. Int J Sci Res 5(7):199–202

Lehal GS (2007) Design and implementation of Punjabi spell-checker. Int J Syst Cybern Inform 3(8):70–75

Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10(8):707–710

MathSciNet   Google Scholar  

Mahar JA, Shaikh H, Memon GQ (2012) A model for Sindhi text segmentation in word tokens. Sindh Univ Res J SUR J (Sci Ser) 44(1):43–48

Mala C, Parameshwari K, Rao GUM, Kulkarni AP (2012) Telugu spell-checker. In: International Telugu internet conference proceedings, pp 1–8

Mandal P, Hossain BMM (2017a) Clustering-based Bangla Spell-checker. In: IEEE international conference on imaging, vision and pattern recognition (icIVPR)

Mandal P, Hossain BMM (2017b) A systematic literature review on spell-checkers for Bangla language. Int J Mod Educ Comput Sci 9(6):40–47

Manohar N, Lekshmipriya PT, Jayan V, Bhadran VK (2015) Spell-checker for Malayalam using finite state transition models. In: IEEE recent advances in intelligent computational systems, RAICS 2015, pp 157–161

Mateen A, Malik MK, Nawaz Z, Danish HM, Siddiqui MH (2017) A hybrid stemmer of Punjabi Shahmukhi script. Int J Comput Sci Netw Secur 17(8):90–97

Mishra D, Venugopalan M, Gupta D (2016) Context-specific lexicon for Hindi reviews. In: 6th international conference on advances in computing and communications, ICACC 2016, vol 93, pp 554–563

Mittal S, Sethi NS, Sharma SK (2014) Part of speech tagging of Punjabi language using N gram model. Int J Comput Appl 100(19):20–23

Mohapatra DD (2018) A sketch of Odia morphology. Glob J Res Anal 7(4):80–81

Mon AM (2012) Spell-checker for Myanmar language. In: International conference on information retrieval and knowledge management (CAMP). IEEE, pp 12–16

Murthy KN (2001) Computer processing of Kannada language. Workshop at Kannada University, pp 1–10

Mustafa SH (2005) Character contiguity in N -gram-based word matching: the case for Arabic text searching. Inf Process Manag 41:819–827

Naseem T (2004) A hybrid approach for Urdu spell-checking. National University of Computer & Emerging Sciences

Naseem T, Hussain S (2007) A novel approach for ranking spelling error corrections for Urdu. Lang Resour Eval 41(2):117–128

Nielsen J (1999) Internet-based spelling checker dictionary system with automatic updating

Nisha M, Reji Rahmath K, Rekha Raj CT, Reghu Raj PC (2015) Malayalam morphological analysis using MBLP approach. In: Proceedings of international conference on soft-computing and network security, ICSNS 2015

Pareek G, Modi D (2016) Feature extraction in Hindi text summarization. Ski Res J 6(2):14–19

Peterson JL (1980) Computer programs for detecting and correcting spelling errors. Commun ACM 23(12):676–687

Prathibha RJ, Padma MC (2016) Design of morphological analyzer for Kannada inflectional words using hybrid approach. Int J Comput Linguist Res 7(4):133–161

Pratip S, Chaudhuri BB (2013) A simple real-word error detection and correction using local word bigram and trigram. In: Proceedings of the 25th conference on computational linguistics and speech processing (ROCLING 2013), pp 211–220

Puri R, Bedi RPS, Goyal V (2015) Punjabi stemmer using Punjabi wordnet database. Indian J Sci Technol 8(27):1–5

Rahman MU (2015) Towards Sindhi corpus construction. In: Conference on language and technology, pp 1–6

Rahutomo F, Kitasuka T, Aritsugi M (2012) Semantic cosine similarity. In: 7th international student conference on advanced science and technology ICAST

Rajashekara Murthy S, Akshatha AN, Upadhyaya CG, Ramakanth Kumar P (2017) Kannada spell-checker with Sandhi splitter. In: International conference on advances in computing, communications and informatics, ICACCI 2017, pp 950–956

Rajashekara Murthy S, Madi V, Sachin D, Ramakanth PK (2012) A non-word Kannada spell-checker using morphological analyzer and dictionary lookup method. Int J Eng Sci Emerg Technol 2(2):43–52

Rama T, Sowmya V (2018) A dependency treebank for Telugu. In: Proceedings of the 16th international workshop on treebanks and linguistics theories, pp 119–128

Robertson AM, Willet P (1998) Applications of N-grams in textual information systems. J Doc 54(1):48–67

Rout Y, Santi PK, Subudhi S, Sahu B (2013) An approach for designing Odia spell-checker. In: National conference on recent advances on business intelligence & data mining (RABIDM 2013), pp 1–7

Saharia N (2011) A first step towards parsing of Assamese text. Spec Vol Probl Parsing Indian Lang 11(5):30–34

Saharia N, Konwar KM (2012) LiuitPad: a fully unicode compatible Assamese writing software. In: Proceedings of the 2nd workshop an advances in text input methods (WTIM 2) COLLING 2012, pp 79–88

Saharia N, Das D, Sharma U, Kalita J (2009) Part of speech tagger for Assamese text. In: Proceedings of the ACL-IJCNLP conference short papers, pp 33–36

Saharia N, Sharma U, Kalita J (2012) Analysis and evaluation of stemming algorithms : a case study with Assamese. In: International conference on advances in computing, communications and informatics, ICACCI 2012, pp 842–846

Sahoo K, Vidyasagar VE (2003) Kannada WordNet—a lexical database. In: Conference on convergent technologies for Asia-Pacific Region (TENCON 2003), vol 4, pp 1352–1356

Sakuntharaj R, Mahesan S (2016) A novel hybrid approach to detect and correct spelling in Tamil text. In: International conference on information and automation for sustainability: interoperable sustainable smart systems for next generation, ICIAFS 2016. IEEE, pp 1–6

Sakuntharaj R, Mahesan S (2017) Use of a novel hash-table for speeding-up suggestions for misspelt Tamil words. In: International conference on industrial and information systems (ICIIS) IEEE, pp 1–5

Santosh T, Sulochana KG, Kumar RR (2002) Malayalam spell-checker. In: Proceedings of the international conference on universal knowledge and language

Saranya SK (2008) Morphological analyzer for Malayalam verbs. Amrita Vishwa Vidyapeetham, Amrita School of Engineering, Coimbatore

Sarma P (2017) An approach to prepare lexicons of Assamese text for unit selection concatenation TTS. Int J Emerg Trends Sci Technol 4(8):5631–5637

Sarma SK, Medhi R, Gogoi M, Saikia U (2010) Foundation and structure of developing an Assamese wordnet. In: Proceedings of 5th international conference of the global WordNet Association

Sarmah J, Barman AK, Sharma SK (2013) Automatic Assamese text categorization using wordnet. In: International conference of advances in computing, communications and informatics IEEE, pp 85–89

Segar J, Sarveswaran K (2015) Contextual spell-checking for Tamil language. In: 14th Tamil internet conference, pp 1–5

Sekhar N, Pushpak D, Jyoti B (2017) The WordNet in Indian Languages. Springer Nature, Singapore

Sethi DP (2014) A survey on Odia computational morphology. Int J Adv Res Comput Eng Technol 3(3):623–625

Shaalan K, Allam A, Gomah A (2003) Towards automatic spell-checking for Arabic. In: Proceedings of the 4th conference on language engineering, Egyptian Society of language engineering (ELSE), Egypt, pp 240–247

Shah ZA, Mashori GM (2013) Oxford English-Sindhi dictionary: a critical study in lexicography. ELF Annu Res J 13:37–46

Shambhavi BR, Ramakanth Kumar P, Srividya K, Jyothi BJ, Kundargi S, Shastri G (2011) Kannada morphological analyser and generator using trie. Int J Comput Sci Netw Secur 11(1):112–116

Sheykholeslam MH, Minaei-Bidgoli B, Juzi H (2013) A framework for spelling correction in Persian language using noisy channel model. In: LREC, pp 58–65

Singh A (2016) Review for dialects in Punjabi language. Int J Innov Adv Comput Sci 5(8):25–30

Singh J, Singh G, Singh R, Singh P (2018) Morphological evaluation and sentiment analysis of Punjabi text using deep learning classification. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2018.04.003

Article   Google Scholar  

Sinha RMK, Singh KS (1984) A programme for correction of single spelling errors in Hindi words. IETE J Res 30(6):249–251

Solak A (1993) Design and implementation of a spelling checker for Turkish. Institute of Engineering & Sciences, Bilkent University, Ankara

Sooraj S, Manjusha K, Anand Kumar M, Soman KP (2018) Deep learning based spell-checker for Malayalam language. J Intell Fuzzy Syst 34(3):1427–1434

Strnad J (2001) Hindi dictionaries and the Hindi lexicographical corpus. Festschrift Helmut Nespital, pp 1–14

Subhashini R, Kumar VJS (2010) Evaluating the performance of similarity measures used in document clustering and information retrieval. In: IEEE, 1st international conference on integrated intelligent computing

Tomovic A, Janicic P, Keselj V (2006) N-gram based classification and unsupervised hierarchical clustering of genome sequences. Comput Methods Program Biomed 81:137–153

Uzzaman N, Khan M (2006) A comprehensive Bangla spelling checker. BRAC University, Dhaka

Varghese ST, Sulochana KG, Kumar RR (2002) Malayalam spell-checker. In: Proceedings of the international conference on universal knowledge and language

Veerappan R, Antony PJ, Saravanan S, Soman KP (2011) A rule-based Kannada morphological analyzer and generator using finite state transducer. Int J Comput Appl 27(10):45–52

Verberne S (2002) Context-sensitive spell-checking based on word trigram probabilities

Wasala A, Weerasinghe R, Pushpananda R (2010) A data-driven approach to checking and correcting spelling errors in Sinhala. Int J Adv ICT Emerg Reg 03(01):11–24

Wu S, Mamber U (1992) AGREP—a fast approximate pattern matching tool. In: Proceedings of the Winter 1992 USENIX conference San Francisco USA. Berkeley, pp 153–162

Yue T, Briand LC, Labiche Y (2011) A systematic review of transformation approaches between user requirements and analysis models. Requir Eng 16(2):75–99

Zampieri M, Cordeiro de Amorim R (2014) Between sound and spelling: combining phonetics and clustering algorithms to improve target word recovery. In: International conference on natural language processing, pp 438–449

Zhang Y, Zhao X (2013) Automatic error detection and correction of text: the state of the art. In: 6th international conference on intelligent networks and intelligent systems, ICINIS, pp 274–277

Zhuang L, Bao T, Zhu X, Wang C, Naoi S (2004) A Chinese OCR spelling check approach based on statistical language models. In: International conference on systems, man and cybernetics, IEEE, vol 5, pp 4727–4732

Download references

Acknowledgements

The authors thank the reviewers for their insightful comments. The authors would also like to thank Ministry of Electronics and IT, Government of INDIA, for providing fellowship under Grant Number: PhD-MLA-4 (69)/2015-16 (Visvesvaraya PhD Scheme for Electronics and IT) to pursue Ph.D. work.

Author information

Authors and affiliations.

Department of Computer Science and Engineering, Punjab Engineering College (Deemed to be University), Chandigarh, India

Shashank Singh & Shailendra Singh

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Shashank Singh .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Singh, S., Singh, S. Systematic review of spell-checkers for highly inflectional languages. Artif Intell Rev 53 , 4051–4092 (2020). https://doi.org/10.1007/s10462-019-09787-4

Download citation

Published : 14 November 2019

Issue Date : August 2020

DOI : https://doi.org/10.1007/s10462-019-09787-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Spell-check
  • Non-word errors
  • Real-word errors
  • Dictionary lookup
  • Edit-distance
  • Recurrent neural network (RNN)
  • Find a journal
  • Publish with us
  • Track your research

Systematic review of spell-checkers for highly inflectional languages

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options.

  • Talpur N Abdulkadir S Alhussian H Hasan M Aziz N Bamhdi A (2023) Deep Neuro-Fuzzy System application trends, challenges, and future perspectives: a systematic survey Artificial Intelligence Review 10.1007/s10462-022-10188-3 56 :2 (865-913) Online publication date: 1-Feb-2023 https://dl.acm.org/doi/10.1007/s10462-022-10188-3
  • Buşe-Dragomir A Popescu P Mihăescu M (2021) Spell Checker Application Based on Levenshtein Automaton Intelligent Data Engineering and Automated Learning – IDEAL 2021 10.1007/978-3-030-91608-4_5 (45-53) Online publication date: 25-Nov-2021 https://dl.acm.org/doi/10.1007/978-3-030-91608-4_5

Index Terms

Applied computing

Arts and humanities

Language translation

Document management and text processing

Computing methodologies

Artificial intelligence

Natural language processing

Language resources

Recommendations

Error detection in highly inflectional languages.

Error detection in OCR output using dictionaries and statistical language models (SLMs) have become common practice for some time now, while designing post-processors. Multiple strategies have been used successfully in English to achieve this. However, ...

A morphosyntactic Brill Tagger for inflectional languages

In this paper we present and evaluate a Brill morphosyntactic transformation-based tagger adapted for specifics of highly inflectional languages. Multi-phase tagging with grammatical category matching transformations and lexical transformations brings ...

HINDIA: a deep-learning-based model for spell-checking of Hindi language

The spelling error is a mistake occurred while typing the text document. The applications like search engines, information retrieval, emails, etc., require user typing. In such applications, good spell-checker is essential to rectify the ...

Information

Published in.

Kluwer Academic Publishers

United States

Publication History

Author tags.

  • Spell-check
  • Non-word errors
  • Real-word errors
  • Dictionary lookup
  • Edit-distance
  • Recurrent neural network (RNN)
  • Research-article

Contributors

Other metrics, bibliometrics, article metrics.

  • 2 Total Citations View Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

spell check Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

An Improved Text Extraction Approach with Auto Encoder for Creating Your Own Audiobook

As we all know, listening makes learning easier and interesting than reading. An audiobook is a software that converts text to speech. Though this sounds good, the audiobooks available in the market are not free and feasible for everyone. Added to this, we find that these audiobooks are only meant for fictional stories, novels or comics. A comprehensive review of the available literature shows that very little intensive work was done for image to speech conversion. In this paper, we employ various strategies for the entire process. As an initial step, deep learning techniques are constructed to denoise the images that are fed to the system. This is followed by text extraction with the help of OCR engines. Additional improvements are made to improve the quality of text extraction and post processing spell check mechanism are incorporated for this purpose. Our result analysis demonstrates that with denoising and spell checking, our model has achieved an accuracy of 98.11% when compared to 84.02% without any denoising or spell check mechanism.

Building Spell-Check Dictionary for Low-Resource Language by Comparing Word Usage

Ai-natural language processing (nlp).

Natural Language Processing (NLP) could be a branch of Artificial Intelligence (AI) that allows machines to know the human language. Its goal is to form systems that can make sense of text and automatically perform tasks like translation, spell check, or topic classification. Natural language processing (NLP) has recently gained much attention for representing and analysing human language computationally. It's spread its applications in various fields like computational linguistics, email spam detection, information extraction, summarization, medical, and question answering etc. The goal of the Natural Language Processing is to style and build software system which will analyze, understand, and generate languages that humans use naturally, so as that you just could also be ready to address your computer as if you were addressing another person. Because it’s one amongst the oldest area of research in machine learning it’s employed in major fields like artificial intelligence speech recognition and text processing. Natural language processing has brought major breakthrough within the sector of COMPUTATION AND AI.

Domain-shift Conditioning using Adaptable Filtering via Hierarchical Embeddings for Robust Chinese Spell Check

An xml infrastructure.

Spell checking has both practical and theoretical significance. The practical connections seem obvious: spell checking makes it easier to find some kinds of errors in documents. But spell checking is sometimes harder and less capable in XML than it could be. If a spell checker could exploit markup instead of just ignoring it, could spell checking be easier and more useful? The theoretical foundations of spell checking may be less obvious, but every spell checker operationalizes both a simple model of language and a model of errors and error correction. The SCX (spell checking for XML) framework is intended to support the author's experimentation with different models of language and errors: it uses XML technologies to tokenize documents, spell check them, provide a user interface for acting on the flags raised by the spell checker, and inserting the corrections into the original text.

Data Extraction and Sentimental Analysis from “Twitter” using Web Scrapping.

In this paper , we attempt to do the sentimental analysis of the 2016 US presidential elections. Sentimental analysis requires the data to be extracted from websites or sources where people present their opinions, views ,complaints about the subjects that need to analyzed .Furthermore, it is necessary to ensure that the sample size of the data is large enough to get conclusive results .It is also necessary to ensure that the data is cleaned before it is used to make predictions. Cleaning is done using common techniques like tokenization, spell check ,etc. Sentimental Analysis is one of the by-products of Natural Language Processing . This paper includes data collection as well as classification of textual data based on machine learning .

An Effective Preprocessing Algorithm for Information Retrieval System

The innovation of web produced a huge of information, evaluates by empowering Internet users to post their assessments, remarks, and audits on the web. Preprocessing helps to understand a user query in the Information Retrieval (IR) system. IR acts as the container to representation, seeking and access information that relates to a user search string. The information is present in natural language by using some words; it’s not structured format, and sometimes that word often ambiguous. One of the major challenges determines in current web search vocabulary mismatch problem during the preprocessing. In an IR system determine a drawback in web search; the search query string is that the relationships between the query expressions and the expanded terms are limited. The query expressions relate to search term fetching information from the IR. The expanded terms by adding those terms that is most similar to the words of the search string. In this manuscript, we mainly focus on behind user’s search string on the web. We identify the best features within this context for term selection in supervised learning based model. In this proposed system the main focus of preprocessing techniques like Tokenization, Stemming, spell check, find dissimilar words and discover the keywords from the user query because provide better results for the user

Potentials of Chatbots for Spell Check among Youngsters

Chatbots are already being used successfully in many areas. This publication deals with the development and programming of a chatbot prototype to support learning processes. This Chatbot prototype is designed to help pupils in order to correct their spelling mistakes by providing correction proposals to them. Especially orthographic spelling mistake should be recognized by the chatbot and should be replaced by correction suggestions stored in test data.

A Study of Students’ Perception on The Use Facebook Group in Improving Writing

This study attempts to know the students’ perception on a Facebook group in improving writing skill on the tenth grade of SMA 5 Kendari that focus on enhancing students writing performance level and the brainstorm ideas at the pre-writing stage. Also, to find out how Facebook influence the students’ affective domain. The researcher employed mix method design with questionnaire as instrument to collect data from 36 students. The data was analyzed by calculating the frequency distribution of Likert  scale.  The  result  reveals  there  was  positive  perception  among  students  about  applying Facebook Group in improving writing. Particularly, Respondent (89.7%) spell – check feature helps students to avoid the error spelling and (91.6) student feel motivated when they got “like” from their friends.  It  is  suggested  that  future  research  to  investigate  the  teacher  and  students’  problem  in applying Facebook. In addition, the future study researcher may investigate the use Facebook Group in another skill of English.   Lastly, the future study may investigation the use other SNS not only Facebook but it can be Path, Twitter, or Instagram. Keywords: Facebook Group, Perception, Writing

A New Approach to Keyboard Inputting Error Prevention and Increasing Inputting Productivity

This article describes the initial concept and developments of a new approach to reducing the number of inputting errors that are made in working with computers and decreasing the time the inputter must spend correcting errors. The approach involves intercepting and correcting errors before they are designated as such by Spell Check and eliminating the need for the time and effort of Spell Check. The operating principles and concepts of this approach, called Super ErrorCorrect™, is described, along with a software suite, that enables the implementation and the testing of the approach. This paper reports on the changes and evolution that resulted from analysis and limited Beta Testing. The preliminary data shows that not only is the Super ErrorCorrect™ approach feasible, but substantial time is saved while error rates are reduced markedly. Some data also suggests that in addition to the time saved in not using Spell Check, there is a tendency for users to type faster as they do not get negative reinforcement when they type faster and make errors as the software fixes the error in real time. Researchers are invited to collaborate in further research and licenses to the technology and software are provided at no cost if research results will be publicly disclosed.

Export Citation Format

Share document.

spell checker research paper

Accommodations Toolkit

Spell check: research.

Share this page

  • Share this page on Facebook.
  • Share this page on Twitter.
  • Share this page on LinkedIn.
  • Share this page via email.
  • Print this page.

National Center on Educational Outcomes (NCEO)

This fact sheet on spell check is part of the Accommodations Toolkit published by the National Center on Educational Outcomes (NCEO). It summarizes information and research findings on spell check as an accommodation [1] . This toolkit also contains a summary of states’ accessibility policies for spell check .

A paragraph of text with numerous misspellings underlined in red

What is spell check? Spell check is a software feature that identifies possible misspellings, and either autocorrects or suggests possible corrections (Cullen et al, 2008; MacArthur, 1999). It is sometimes referred to as spell checker, spelling checker, spelling assistance. Spell check can help students correct spelling errors with less time focused on the writing mechanics of spelling which then allows them to concentrate more broadly on developing ideas or content in the writing process (MacArthur, 1999).

What are the research findings on who should use this accommodation? Spell check has been used for students with various disabilities in the elementary grades (Finch & Finch, 2013) and secondary grades (Finizio, 2008; Koretz & Hamilton, 2001). According to research findings, most of the students who receive this accommodation have specific learning disabilities (SLD) (Finizio, 2008; Koretz & Hamilton, 2001).

What are the research findings on implementation of spell check? No studies were identified on the implementation of spell check. Three studies examined the frequency of spell check.

  • Two studies examined how frequently students received the spell check accommodation, and both found that spell check was one of the least frequently assigned accommodations at the elementary (Finch & Finch, 2013) and secondary (Koretz & Hamilton, 2001) levels.
  • Finizio (2008) examined the match relationship between instructional accommodations and state assessment accommodations documented in the individualized education programs (IEPs) of secondary students with various disabilities, most of whom had SLD. The results indicated that spell checking was mostly used as an instructional accommodation and not generally used on assessments.

What perceptions do students and teachers have about spell check? No studies were found that examined student or teacher perceptions of spell check as an assessment accommodation.

What have we learned overall? Research studies found that spell check is one of least assigned assessment accommodations, though it may be used more often during instruction. It is used for elementary and secondary students with various disabilities, and is most frequently provided to students with SLD. No studies were identified that examined the effect of spell check on student performance. Research is needed on the effect of spell check on the performance of students with different disabilities, including English learners with disabilities. Likewise, there is a need to explore teacher and student perceptions of the spell check accommodation.

  • Cullen, J., Richards, S., & Frank, C. L. (2008). Using software to enhance the writing skills of students with special needs . Journal of Special Education Technology , 23 (2), 33–44. https://doi.org/10.1177/016264340802300203
  • Finch, W. H., & Finch, M. E. H. (2013). Differential item functioning analysis using a multilevel Rasch mixture model: Investigating the impact of disability status and receipt of testing accommodations . Journal of Applied Measurement , 15 (2), 133–151. http://jampress.org/
  • Finizio, N. J., II. (2008). The relationship between instructional and assessment accommodations on student IEPs in a single urban school district (Publication No. 3313763) [Doctoral dissertation, University of Massachusetts Boston]. ProQuest Dissertations and Theses Global.
  • Koretz, D., & Hamilton, L. (2001). The performance of students with disabilities on New York’s Revised Regents Comprehensive Examination in English (CSE Technical Report No. 540). Center for the Study of Evaluation (CRESST), UCLA. http://www.rand.org/content/dam/rand/pubs/drafts/2008/DRU2608.pdf
  • MacArthur, C. A. (1999). Word prediction for students with severe spelling problems . Learning Disability Quarterly , 22 (3), 158–172. https://doi.org/10.2307/1511283

Attribution

All rights reserved. Any or all portions of this document may be reproduced and distributed without prior permission, provided the source is cited as:

  • Goldstone, L., Lazarus, S. S., Hendrickson, K., Rogers, C., & Hinkle, A. R. (2022). Spell check: Research (NCEO Accommodations Toolkit #27a) . National Center on Educational Outcomes.

The Center is supported through a Cooperative Agreement (#H326G210002) with the Research to Practice Division, Office of Special Education Programs, U.S. Department of Education. The Center is affiliated with the Institute on Community Integration at the College of Education and Human Development, University of Minnesota. Consistent with EDGAR §75.62, the contents of this report were developed under the Cooperative Agreement from the U.S. Department of Education, but do not necessarily represent the policy or opinions of the U.S. Department of Education or Offices within it. Readers should not assume endorsement by the federal government. Project Officer: David Egnor

Icon(s) used on this page:

External Link Indicator Icon

Free Essay and Paper Checker

Try our other writing services

Paraphrasing Tool

Correct your entire essay within 5 minutes

  • Proofread on 100+ language issues
  • Specialized in academic texts
  • Corrections directly in your essay

Correct your entire essay in 5 minutes

Why this is the best free essay checker.

Best Grammar Checker Test Result Graph

Tested most accurate

In the test for the best grammar checker , Scribbr found 19 out of 20 errors.

No Signup Needed

No signup needed

You don’t have to register or sign up. Insert your text and get started right away.

Unlimited words and characters

Long texts, short texts it doesn’t matter – there’s no character or word limit.

The Grammar Checker is Ad-Free

Don’t wait for ads or distractions. The essay checker is ad-free!

Punctuation checker

Nobody's perfect all the time—and now, you don’t have to be!

There are times when you just want to write without worrying about every grammar or spelling convention. The online proofreader immediately finds all of your errors. This allows you to concentrate on the bigger picture. You’ll be 100% confident that your writing won’t affect your grade.

grammar mistake

Correcting your grammar

The Scribbr essay checker fixes grammar mistakes like:

  • Sentence fragments & run-on sentences
  • Subject-verb agreement errors
  • Issues with parallelism

spelling mistake

Spelling & Typos

Basic spell-checks often miss academic terms in writing and mark them as errors. Scribbr has a large dictionary of recognized (academic) words, so you can feel confident every word is 100% correct.

Punctuation errors

The essay checker takes away all your punctuation worries. Avoid common mistakes with:

  • Dashes and hyphens
  • Apostrophes
  • Parentheses
  • Question marks
  • Colons and semicolons
  • Quotation marks

word use

Avoid word choice errors

Should you use   “affect” or “effect” ? Is it   “then” or “than” ? Did you mean   “there,” “their,” or “they’re” ?

Never worry about embarrassing word choice errors again. Our grammar checker will spot and correct any errors with   commonly confused words .

accept all

Improve your text with one click

The Scribbr Grammar Checker allows you to accept all suggestions in your document with a single click.

Give it a try!

spell checker research paper

Correct your entire document in 5 minutes

Would you like to upload your entire essay and check it for 100+ academic language issues? Then Scribbr’s AI-powered proofreading is perfect for you.

With the AI Proofreader, you can correct your text in no time:

  • Upload document
  • Wait briefly while all errors are corrected directly in your document
  • Correct errors with one click

Proofread my document

all english variants

A Grammar Checker for all English variants

There are important differences between the versions of English used in different parts of the world, including UK and US English . Our essay checker supports a variety of major English dialects:

  • Canadian English
  • Australian English

Why users love our Essay Checker

🌐 English US, UK, CA, & AU
🏆 Quality Outperforms competition
✍️ Improves Grammar, spelling, & punctuation
⭐️ Rating based on 13,500 reviews

Save time and upload your entire essay to fix it in minutes

Scribbr & academic integrity.

Scribbr is committed to protecting academic integrity. Our plagiarism checker , AI Detector , Citation Generator , proofreading services , paraphrasing tool , grammar checker , summarizer , and free Knowledge Base content are designed to help students produce quality academic papers.

We make every effort to prevent our software from being used for fraudulent or manipulative purposes.

Ask our team

Want to contact us directly? No problem.  We  are always here for you.

Support team - Nina

Frequently asked questions

Our Essay Checker can detect most grammar, spelling, and punctuation mistakes. That said, we can’t guarantee 100% accuracy. 

Absolutely! The Essay Checker is particularly useful for non-native English speakers, as it can detect mistakes that may have gone unnoticed.

The exact time depends on the length of your document, but, in most cases it doesn’t take more than a minute.

  • Corpus ID: 17858148

Spell Checker for OCR

  • Yogomaya Mohapatra , A. Mishra , A. Mishra
  • Published 2013
  • Computer Science

6 Citations

Customised ocr correction for historical medical text, a performance comparison and post-processing error correction technique to ocrs for printed tamil texts, post-processing methodology for word level telugu character recognition systems using unicode approximation models, searching corrupted document collections, enhancement of text recognition in scene images, understanding error correction and its role as part of the communication channel in environments composed of self-integrating systems.

  • Highly Influenced

14 References

Techniques for automatically correcting words in text, automatic spelling correction using a trigram similarity measure, spelling correction for search engine queries, developing a spell checker as an expert system, a study of n-gram and decision tree letter language modeling methods, n-gram statistics for natural language understanding and text processing, history (forward n-gram) or future (backward n-gram) which model to consider for n-gram analysis in bangla, related papers.

Showing 1 through 3 of 0 Related Papers

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

  • Artificial Intelligence
  • Neural Networks (Computer)

A Research on Online Grammar Checker System Based on Neural Network Model

  • November 2020
  • Journal of Physics Conference Series 1651(1):012135
  • 1651(1):012135
  • This person is not on ResearchGate, or hasn't claimed this research yet.

Abstract and Figures

The flowchart of our grammar checker

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations
  • Comput Intell Neurosci
  • O. Yaroshenko
  • Nataliia Kholodna

Victoria Vysotska

  • Vikas Verma
  • S. K. Sharma

Roman Grundkiewicz

  • Kenneth Heafield
  • Laksnoria Karyuatry

Muhammad Dhika Arif Rizqon

  • Kostiantyn Omelianchuk
  • Vitaliy Atrasevych

Artem Chernodub

  • Oleksandr Skurzhanskyi

Abhijeet Awasthi

  • Sunita Sarawagi
  • Rasna Goyal
  • Vihari Piratla
  • Christopher Bryant
  • Mariano Felice
  • Øistein E. Andersen

Ted Briscoe

  • Yo Joong Choe
  • Kyubyong Park
  • Jingming Liu

Sergey Edunov

  • Michael Auli
  • John Richardson
  • Shubha Guha
  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up
  • Open access
  • Published: 05 August 2024

Improving the quality of Persian clinical text with a novel spelling correction system

  • Seyed Mohammad Sadegh Dashti 1 &
  • Seyedeh Fatemeh Dashti 2  

BMC Medical Informatics and Decision Making volume  24 , Article number:  220 ( 2024 ) Cite this article

Metrics details

The accuracy of spelling in Electronic Health Records (EHRs) is a critical factor for efficient clinical care, research, and ensuring patient safety. The Persian language, with its abundant vocabulary and complex characteristics, poses unique challenges for real-word error correction. This research aimed to develop an innovative approach for detecting and correcting spelling errors in Persian clinical text.

Our strategy employs a state-of-the-art pre-trained model that has been meticulously fine-tuned specifically for the task of spelling correction in the Persian clinical domain. This model is complemented by an innovative orthographic similarity matching algorithm, PERTO, which uses visual similarity of characters for ranking correction candidates.

The evaluation of our approach demonstrated its robustness and precision in detecting and rectifying word errors in Persian clinical text. In terms of non-word error correction, our model achieved an F1-Score of 90.0% when the PERTO algorithm was employed. For real-word error detection, our model demonstrated its highest performance, achieving an F1-Score of 90.6%. Furthermore, the model reached its highest F1-Score of 91.5% for real-word error correction when the PERTO algorithm was employed.

Conclusions

Despite certain limitations, our method represents a substantial advancement in the field of spelling error detection and correction for Persian clinical text. By effectively addressing the unique challenges posed by the Persian language, our approach paves the way for more accurate and efficient clinical documentation, contributing to improved patient care and safety. Future research could explore its use in other areas of the Persian medical domain, enhancing its impact and utility.

Peer Review reports

Introduction

Spelling correction is a vital task in all text processing environments, with its importance amplified for languages with intricate morphology and syntax, such as Persian. This significance is further heightened in the realm of clinical text, where precise documentation is a cornerstone for effective patient care, research, and ensuring patient safety The written text of medical findings remains the essential source of information for clinical decision making. Clinicians prefer to write unstructured text rather than filling out structured forms when they document the progress notes, due to time and efficiency constraints [ 1 ]. The quality and safety of health care depend on the accuracy of clinical documentation [ 2 ]. However, misspellings often occur in clinical texts because they are written under time pressure [ 3 ].

The process of spelling correction primarily tackles two types of errors: non-word errors, which are nonsensical words not found within a dictionary, and real-word errors, that are correctly spelled words but utilized inappropriately in context. These errors can stem from various sources including typographical mistakes, confusion between similar sounding or meaning words [ 4 ], incorrect replacements by automated systems like AutoCorrect features [ 5 ], and misinterpretation of input by ASR and OCR systems [ 6 , 7 , 8 , 9 ].

The Persian language, with its rich vocabulary and complex properties, presents unique challenges for real-word error correction. Features unique to Persian such as homophony (words that are pronounced identically yet carry distinct meanings), polysemy (words with multiple meanings), heterography (words that share identical spelling but their meanings vary based on how they are pronounced), and word boundary issues contribute to this complexity.

Despite these challenges, numerous efforts have been made to develop both statistical and rule-based approaches for identifying and rectifying both classes of errors in the general Persian text domain; however, the work in the Persian medical domain and specifically the Persian clinical text is very limited. Moreover, these methods have attained only limited success. In this study, we introduce an innovative method to detect and correct word errors in Persian clinical text, aiming to significantly improve the accuracy and reliability of healthcare documentation. Our key contributions include:

Language Representation Model: We showcase a pre-trained language representation model that has undergone meticulous fine-tuning, specifically for the task of spelling correction in the Persian clinical domain.

PERTO Algorithm: We introduce an innovative orthographic similarity matching algorithm that leverages the visual resemblance of characters to prioritize correction candidates.

We utilize the F1-score metric to evaluate and contrast our methodology with established approaches for detecting and rectifying both non-word and real-word errors within the context of Persian clinical text.

The rest of this paper is structured as follows: We commence with a review of prior research in the field. Following this, we delve into the challenges faced in Persian language text processing. Subsequently, we outline our proposed approach. Evaluation and experiment results are then presented and discussed. In the final segment, we summarize our findings.

Related works

Automatic word error correction is a crucial component in NLP systems, particularly in the context of EHR and clinical reports. Early techniques were based on edit distance and phonetic algorithms [ 10 , 11 , 12 , 13 ]. The incorporation of context information has been demonstrated to be effective in boosting the efficiency of auto-correction systems [ 14 ]. Contextual measures like semantic distance and noisy channel models based on N-grams have been employed across numerous NLP applications [ 4 , 5 , 15 , 16 , 17 ]. A novel approach was also developed to correct multiple context-sensitive errors in excessively noisy situations [ 18 ]. Dashti developed a model that addressed the identification and automatic correction of context-sensitive errors in cases where more than one error existed in a given word sequence [ 19 ].

Cutting-edge methods in NLP systems utilize context information through neural word or sense embeddings for spelling correction [ 20 ]. Pretrained contextual embeddings have been used to detect and rectify context-sensitive errors [ 21 ]. The issue of spelling correction has been addressed using deep learning techniques for various languages in recent years. For example, a study in 2020 proposed a deep learning method to correct context-sensitive spelling errors in English documents [ 22 ]. Another work developed a BERT-Based model for the same purpose [ 23 ]. NeuSpell is a user-friendly neural spelling correction toolkit that offers a variety of pre-trained models [ 24 ]. SpellBERT is a lightweight pre-trained model for Chinese spelling check [ 25 ]. A disentangled phonetic representation approach for Chinese spelling correction was proposed [ 26 ]. Other approaches for Chinese spelling correction utilized phonetic pre-training [ 27 ]. An innovative approach was devised specifically for the purpose of contextual spelling correction within comprehensive speech recognition systems [ 28 ]. A dual-function framework for detecting and correcting spelling errors in Chinese was proposed [ 29 ]. Liu and colleagues proposed a method, known as CRASpell, which is resilient to contextual typos and has been developed to enhance the process of correcting spelling errors in Chinese [ 30 ]. AraSpell is an Arabic spelling correction approach that utilized a Transformer model to understand the connections between words and their typographical errors in Arabic [ 31 ].

In the realm of healthcare, the application of spelling correction techniques has been instrumental in expanding acronyms and abbreviations, truncating, and rectifying misspellings. It has been observed that such instances constitute up to 30% of clinical content [ 32 ]. In the last twenty years, a significant amount of research has been conducted on spelling correction methods specifically designed for clinical texts [ 1 ]. The majority of these studies have primarily focused on EHR [ 33 ], while a few have explored consumer-generated texts in healthcare [ 34 , 35 ].

Several noteworthy contributions in this field include the French clinical record spell checker introduced by Ruch and colleagues, which boasts a correction rate of up to 95% [ 36 ]. Siklósi and his associates devised a system that is aware of context for Hungarian clinical text, which is grounded on statistical machine translation, and it attained an accuracy rate of 87.23% [ 37 ]. Grigonyte and her research team introduced a system tailored for Swedish clinical text, achieving a precision of 83.9% and a recall rate of 76.2% [ 38 ].

Zhou and colleagues leveraged the Google spell checker to develop a system capable of accurately correcting 86% of typographical and linguistic inaccuracies found in routine medical terminologies [ 35 ]. Another study deliberated on a spelling correction system that was referenced in reports concerning the safety of vaccines, with recall and precision rates of 74% and 47%, respectively [ 39 ]. Wong and his team have designed a system that operates in real-time to rectify spelling errors in clinical reports, achieving an accuracy of 88.73%. This system leverages the power of semantic and statistical analysis applied to web data for the purpose of automatic correction [ 1 ]. Doan and his research team presented a system, specifically designed for the rectification of misspellings in drug names. This system, which is based on the Aspell algorithm, reported a commendable precision rate of 80% [ 40 ].

Among the recent contributions is an article by Lai and colleagues proposing a system for automatic spelling correction in medical texts, employing a noisy channel model to achieve significant accuracy [ 41 ]. Similarly, unsupervised, context-aware models have shown promise in correcting spelling errors in English and Dutch clinical unstructured texts [ 42 , 43 ].

While these advancements have significantly improved spelling correction across languages and domains, recent innovations in BCIs, eye-tracking, VR/AR, and non-invasive EEG technologies open new avenues for further enhancing human–computer interaction and the accuracy of medical documentation [ 44 , 45 , 46 , 47 ]. These technologies, through their unique capabilities to interact directly with the user's cognitive states and attention, offer potential solutions to some of the inherent limitations of current NLP systems in understanding and correcting complex, context-sensitive errors in clinical texts. As the field continues to evolve, integrating these cutting-edge technologies into spelling correction tools for medical documentation could revolutionize the way healthcare professionals interact with digital text, making the process more efficient, accurate, and tailored to their specific needs.

In addition, the emergence and application of Optical technology in the healthcare sector over the past twenty years has led to the creation of several systems designed to detect and correct OCR errors automatically. A reference to one such system can be found in [ 48 ]; this system identifies and rectifies typographical errors in French clinical documents. In a newer study, Tran and colleagues suggested a model for spelling correction in clinical text that is sensitive to context [ 49 ].

Despite the complexities inherent in the Persian language, substantial progress has been made in the field of spelling correction. The strategies employed range from statistical or rule-based methods to more contemporary systems, such as the Vafa spellchecker, which is capable of detecting a wide variety of errors. Mosavi and Miangah have addressed spelling issues in the Persian language using N-grams, a monolingual corpus, and a measure of string distance [ 50 , 51 , 52 , 53 , 54 , 55 , 56 , 57 ]. Within these methodologies, one focuses on correcting typographical errors in clinical text, utilizing a four-gram language model. Consequently, the need for a Persian spell-checking tool in specialized domains, such as healthcare, is clear.

Given the variety of methodologies and their targeted applications in spelling correction, we provide Table  1 below to efficiently summarize the key contributions within the medical domain and Persian language spelling correction models. This comparative analysis not only illuminates the range of strategies employed to address spelling correction challenges across diverse languages and contexts but also underlines the distinctive features of each method. In doing so, it enhances our understanding of the current research landscape in this field, spotlighting the innovative approaches and shedding light on the potential avenues for future exploration.

Persian spelling challenges

Persian, alternatively referred to as Farsi, belongs to the Indo-Iranian subgroup of the Indo-European family of languages. It holds official language status in countries such as Iran, Tajikistan, and Afghanistan. Over time, Persian has incorporated elements from other languages such as Arabic, thereby enriching its vocabulary. Despite these influences, the fundamental structure of the language has largely remained intact for centuries [ 55 , 59 ].

While Persian is a vibrant and expressive language, it presents several challenges for language processing:

Character Ambiguity : Persian characters like “ی” and “ي” are often used interchangeably but represent different sounds [ 60 ].

Rich Morphology : New words can be created by adding prefixes and suffixes to a base word, like “دست” (hand) to “دست‌ها” (hands) [ 61 ].

Orthography : Persian involves a combination of spaces and semi-spaces, which can lead to inconsistencies [ 62 ].

Co-articulation : The pronunciation of a consonant like “ب” can be affected by the subsequent vowel [ 63 ].

Dialectal Variation : Persian has several standard varieties such as Farsi, Dari, and Tajik [ 64 ].

Cultural Factors : The phenomenon of persianization can shape the way Persian is used and interpreted.

Lack of Resources : Often, Persian is classified as a language with limited resources, given the scarcity of accessible data and tools for Natural Language Processing [ 61 ].

Free Word Order : Persian allows for the rearrangement of words within a sentence without significantly altering its meaning [ 65 ].

Homophony : Different words have identical pronunciation but different meanings, like (“گذار” /gʊzɑr/ ‘transition’) Footnote 1 and (“گزار” /gʊzɑr/ ‘predicate’) [ 66 ].

Diacritics : They are frequently left out in writing, leading to ambiguity in word recognition [ 67 ].

Rapidly Changing Vocabulary : Persian’s vocabulary is rapidly evolving due to factors such as technology, globalization [ 68 ].

Lack of standardization : There isn’t a single standard for Persian text, which can complicate the development of language processing models capable of handling a variety of dialects and styles [ 69 ].

A significant issue is the treatment of internal word boundaries, often represented by a zero-width non-joiner space or “pseudo-space”. Ignoring these can lead to text processing errors. Pre-processing steps can help resolve these issues by correcting pseudo and white spaces according to internal word boundaries and addressing tokenization problems.

These challenges highlight the need for robust computational models and resources that can handle the intricacies of the Persian language while ensuring accurate language processing.

Material and methods

Our methodology detects and corrects two categories of mistakes in Persian clinical text: Non-word and Real-word errors. The architecture of the proposed system is depicted in Fig.  1 . The system design is composed of five distinct modules that communicate via a databus.

figure 1

Architecture of the proposed system for detecting and correcting Persian word errors

The INPUT module accepts raw test corpora. The pre-processing component normalizes the text and addresses word boundary issues. The contextual analyzer module assesses the contextual similarity within desired word sequences.

For error detection, we implement a dictionary reference technique to pinpoint non-word errors and use contextual similarity matching to detect real-word errors. The error correction module rectifies both classes of errors using context information from a fine-tuned contextual embeddings model, in conjunction with orthographic and edit-distance similarity measures.

The corrected corpora or word sequence is then delivered through the OUTPUT module.

Pre-processing step

Text pre-processing is a crucial step in numerous NLP applications, which includes the segmentation of sentences, tokenization, normalization, and the removal of stop-words. The segmentation of sentences involves determining the boundaries of a sentence, usually marked by punctuation such as full stops, exclamation marks, or question marks. Tokenization is the process of decomposing a sentence into a set of terms that capture the sentence's meaning and are utilized for feature extraction. Normalization is the procedure of converting text into its standard forms and is particularly important in NLP applications for Persian, as it is for many other languages. A key task in normalizing Persian text is the conversion of pseudo and white spaces into regular forms, replacing whitespaces with zero-width non-joiners when necessary.

For example, (‘می شود’ /mi ʃævæd/ ‘is becoming’) is replaced with (‘میشود’ / miʃævæd / ‘is becoming’). Persian and Arabic have numerous similarities, and certain Persian alphabets are frequently incorrectly written using Arabic versions. It is often advantageous for researchers to normalize these discrepancies by substituting Arabic characters (ي ‘Y’ /j/; ک ‘k’ /k/; ه ‘h’ /h/) with their corresponding Persian forms. For instance, (‘براي’ /bærɑy/ ‘for’) is transformed to (‘برای’ /bærɑy/ ‘for’). Normalization also includes removing diacritics from Persian words; e.g., (‘ذرّه’ /zærre/ ‘particle) is changed to (‘ذره’ /zære/ ‘particle). Additionally, Kashida(s) are removed from words; for instance, (‘بــــــاند’ /bɑnd/ ‘band’) is transformed to (‘باند’ /bɑnd/ ‘band’).

In order to accomplish the goal of normalization, a dictionary named Dehkhoda, which includes the correct typographic form of all Persian words, is utilized to determine the standard form of words that have multiple shapes [ 70 ].

Damerau-Levenshtein distance and candidate generation

Our methodology employs the Damerau-Levenshtein distance metric to generate potential rectifications for both non-word and real-word errors [ 11 ]. This measure considers insertion, deletion, substitution, and transposition of characters. For instance, the measure of Damerau-Levenshtein distance between "KC" and "CKE" equals 2. It’s found that around 80% of human-generated spelling errors involve these four error types [ 71 ]. Studies indicate that context-sensitive error constitute approximately 25% to 40% of all typographical errors in English documents [ 72 , 73 ].

Our model utilizes an extensive dictionary to pinpoint misspellings. This dictionary is bifurcated into two segments: general and specialized terms. For the general segment, we employ the Vafa spell-checker dictionary, a highly respected spell checker for the Persian language. This dictionary encompasses 1,095,959 terms, all of which are general terms, but it excludes specialized medical terminology. In this research, we utilized the texts we trained to formulate a custom dictionary. This dictionary integrates specialized terminology found in breast ultrasonography, head and neck ultrasonography, and abdominal and pelvic ultrasonography texts. It was further enriched with translations from the Radiological Sciences Dictionary by David J Dowsett to pinpoint misspellings of specialized terms [ 74 ]. This dictionary comprises 10,332 terms, all of which are specialized terms in the field of breast ultrasound, head and neck ultrasound, and abdominal and pelvic ultrasound. However, this specialized dictionary does not encompass general terms.

To circumvent duplication of specialized terms, we juxtaposed our comprehensive dictionary with the Radiological Sciences Dictionary using a custom software developed by the researchers of this study. This ensured that no term was included more than once in the dictionary, as some terms might be present in both dictionaries.

Upon our analysis of the test data, we concluded that an edit distance of up to 2 between the candidate corrections and error would be ideal. With an edit distance set to one, an average of three candidates are generated as potential replacements for a target context word. However, when the edit distance is increased to 2, the average number of generated candidates rises to 15. Correspondingly, the computation time also increases. We ensure that the generated candidates are validated against the reference lexicon.

Contextual embeddings

Word embeddings, which analyze vast amounts of text data to encapsulate word meanings into low-dimensional vectors [ 75 , 76 ], retain valuable syntactic and semantic information [ 77 ] and are advantageous for numerous NLP applications [ 78 ]. However, they grapple with the issue of meaning conflation deficiency, which is the inability to differentiate between multiple meanings of a word.

To tackle this, cutting-edge approaches represent specific word senses, referred to as contextual embeddings or sense representation. Context-sensitive word embedding techniques such as ELMo consider the context of the input sequence [ 65 ]. There exist two main strategies for pre-training language representation model: feature-oriented methods and fine-tuning methods [ 79 ]. Fine-tuning techniques train a language model utilizing large datasets of unlabeled plain texts. The parameters of these models are later fine-tuned using data that is pertinent to the task at hand [ 79 , 80 , 81 ]. However, pre-training an efficient language model demands substantial data and computational resources [ 82 , 83 , 84 , 85 ]. Models that are multilingual have been formulated for languages that share morphological and syntactic structures. However, languages that do not use the Latin script significantly deviate from those that do, thereby requiring an approach that is specific to each language [ 86 ]. This challenge is also common in the Persian language. Although some multilingual models encompass Persian, their performance may not match that of monolingual models, which are specifically trained on a language-specific lexicon with more extensive volumes of Persian text data. As far as we are aware, ParsBert [ 69 ] and SinaBERT [ 87 ] are the sole efforts to pre-train a Bidirectional Encoder Representation Transformer (BERT) model explicitly for the Persian language.

Pre-trained language representation model

Persian is often recognized as an under-resourced language. Despite the existence of language models that support Persian, only two, namely ParsBert [ 69 ] and SinaBERT [ 87 ], have been pre-trained on large Persian corpora. ParsBERT was pre-trained on data from the general domain, which includes a substantial amount of informal documents such as user reviews and comments, many of which contain misspelled words.

Conversely, SinaBERT was pre-trained on unprocessed text from the overarching medical field. The data for SinaBERT was compiled from a diverse set of sources such as websites that provide health and medical news, websites that disseminate scientific information about health, nutrition, lifestyle, and more, journals (encompassing both abstracts and complete papers) and conference proceedings, scholarly written materials, medical reference books and dissertations, online forums centered around health, medical and health-related Instagram pages, along with medical channels and groups on Telegram.

The data primarily consisted of general medical domain data, a portion of which was informal and contained misspellings. These factors make these pre-trained models unsuitable for Persian clinical domain spelling correction tasks. The lack of an efficient language model in this domain poses a considerable hurdle. In the subsequent section, we will explore our Persian Clinical Corpus and the procedure of pre-training our language representation model.

While numerous formal general domain Persian medical texts are freely accessible, they may not be ideal for spelling correction in clinical texts. Conversely, Persian clinical texts are not widely available to the public. Nevertheless, the use of Persian clinical text is essential for pre-training a language representation model specifically for spelling correction in Persian clinical text. Consequently, we assembled a substantial collection of Persian Clinical texts to train an effective model for spelling correction in Persian.

Our data comprises a total of 78,643 ultrasonography reports, which were obtained from three distinct datasets. These datasets were generously provided by the Department of Imaging's HIS at Tehran's Imam Khomeini Hospital. For a detailed breakdown of these datasets, please refer to Table  2 .

Each dataset comprised three different types of medical reports: breast ultrasonography, head and neck ultrasonography, and abdominal and pelvic ultrasound reports. The first dataset, spanning from January 2011 to February 2015, included 22,504 reports with a total of 7,538,840 words. The average report length in this dataset was 335 words. The second dataset contained 15,888 reports and 4,782,288 words, encompassing all texts entered by medical typists from March 2015 to July 2018. The average length of sonography reports in this dataset was 301 words. The third dataset, which covers the period from August 2018 to June 2023, comprises 40,251 reports and a total of 14,007,348 words. All of these reports were inputted by medical typists. The average word count for the sonography reports in this dataset is 348 words. Upon analyzing the corpus, we found that 1.2% of the words in the corpora represent instances of errors, which can be classified into two types: non-word errors and real-word errors. Further scrutiny revealed that out of this 1.2% segment, non-word errors constitute 1%, while the remaining 0.2% are real-word errors.

We employed a random selection process to ensure a fair representation of the entire corpora in both the testing and training datasets. Specifically, 10% of the sentences from the corpora, amounting to 188,963 sentences, were randomly chosen for testing and evaluation. The remaining 90% of the sentences, which equates to 1,700,668 sentences, were allocated for the fine-tuning and pre-training of the model. Of these, 10% were used for fine-tuning and the rest, 90%, for pre-training. This process encompassed several steps including normalization, pre-processing, and the removal of punctuation marks, tags, and so forth. In addition, we addressed both real-word and non-word errors present in the training corpus. This meticulous approach ensures the robustness and accuracy of our model.

Model architecture

The structure of our suggested model is founded on the original \({\mathbf{B}\mathbf{E}\mathbf{R}\mathbf{T}}_{\mathbf{B}\mathbf{A}\mathbf{S}\mathbf{E}}\) setup, which comprises 12 hidden layers, 12 attention heads, 768 hidden sizes, and a total of 110M parameters. Our model is designed to handle a maximum token capacity of 512. The architecture of the model is depicted in Fig.  2 . BERT's success is often attributed to its MLM pre-training task, where it randomly masks or replaces tokens before predicting the original tokens [ 80 ]. This feature makes BERT particularly suitable for a spelling checker, as it interprets the masked and altered tokens as misspellings. In the embedding layer of BERT, each input token, denoted as \({\mathbf{T}}_{\mathbf{i}}\) , is indexed to its corresponding embedding representation, \({\mathbf{E}\mathbf{R}}_{\mathbf{i}}\) . This \({\mathbf{E}\mathbf{R}}_{\mathbf{i}}\) is then forwarded to BERT's encoder layers to obtain the subsequent representation, \({\mathbf{H}\mathbf{R}}_{\mathbf{i}}\) .

figure 2

Architecture of Pre-trained Language Representation Model for Persian Clinical Text Spelling Correction

In this context, both \({\text{ER}}_{\text{i}}\) and \({\text{HR}}_{\text{i}}\) belong to the real number space \({R}^{1*d}\) , where \(d\) represents the hidden dimension. Subsequently, the similarities between \({\text{HR}}_{\text{i}}\) and all token embeddings are calculated to predict the distribution of \({\text{Y}}_{\text{i}}\) over the existing vocabulary.

where \({\varvec{E}}\boldsymbol{ }\in {R}^{V*d}\) and \({\text{Y}}_{\text{i}}\) \(\in\) \({R}^{1*V}\) ; here \(V\) signifies the size of the vocabulary and \({\varvec{E}}\) represents the BERT embedding layer. The \(i\) th row of \({\varvec{E}}\) aligns with \({\text{ER}}_{\text{i}}\) in accordance with Eq.  1 . The ultimate rectification outcome for \({\text{T}}_{\text{i}}\) is the \({\text{T}}_{\text{k}}\) token, whose corresponding  \({\text{ER}}_{\text{k}}\) exhibits the greatest similarity to \({\text{HR}}_{\text{i}}\) .

Fine-tuning for spelling correction task

We fine-tuned the pre-trained model specifically for the task of spelling correction in Persian clinical text, aiming to achieve optimal performance. For this fine-tuning process, we utilized 10% of the reserved sentences from the training corpus, amounting to 170,066 sentences. Each input to the model was a single sentence ending with a full stop, as our primary focus was on training the model for spelling correction. Upon examining the test set, we found that many sentences were short, and masking a few tokens would significantly reduce the context. Consequently, we excluded sentences with fewer than 20 words from the corpus. In the end, we selected 122,162 sentences, each with a minimum length of 20 words. However, since the input was a list of sentences that couldn't be directly fed into the model, we tokenized the text. The objective of the error correction task is to predict target or masked words by gaining context from adjacent words. Essentially, the model tries to reconstruct the original sentence from the masked sentence received in the input at the output. Therefore, the target labels are the actual input_ids of the tokenizer.

In the original \({\text{BERT}}_{\text{BASE}}\) model, 15% of the input tokens were masked, with 80% replaced with [mask] tokens, 10% replaced with random tokens, and the remaining 10% left unchanged. However, in our fine-tuning task, we only replaced 15% of the input tokens with [mask], except for special ones; we did not use [mask] tokens to replace [SEP] and [CLS] tokens. We also avoided the random replacement of tokens to achieve better results. We used TensorFlow [ 88 ] for training with Keras [ 89 ]. Additionally, we used the Adam optimizer with a learning rate of 1E-4. The batch size was 32 and each model was run for 4 epochs.

PERTO algorithm

We have designed an algorithm called PERTO, which stands for Persian Orthography Matching. This algorithm ranks the most likely candidate words derived from the output of a pre-trained model, based on shape similarity. In this algorithm, every character in the Persian script is given a distinct code. Characters that share similar forms or glyphs are classified under the same code, enabling words with similar shape characters to be identified, even if there are slight spelling variations. Our pioneering hybrid model classifies characters with the same shapes into identical groups, as depicted in Table  3 .

In order to identify shape similarity in Persian, a PERTO code is generated for the incorrectly spelled word. This code is subsequently matched with the PERTO codes of all potential words generated via edit distance. Our model distinctively merges PERTO with a contextual score ranking system. PERTO is solely utilized for substitution errors. In cases of insertion or deletion type errors, where the PERTO codes of all potential words do not correspond to the PERTO code of the misspelled word, our model depends entirely on contextual scores derived from the pre-trained model. Pseudocode1 outlines the implementation details of the PERTO algorithm.

To illustrate the PERTO code generation process, let us consider the word "پرگاز," which translates to "a stomach full of gas" in English. The generation of the PERTO code for this word, as per the method outlined in Pseudocode1, is as follows:

We begin with the first character on the right side of the word and find its hash code from Table  3 . The code for "پ" is 1, which we store in an empty string.

Moving one unit to the left, we retrieve the hash code for the character "ر," which is 4, and add this digit to the string.

This process continues for each character in the word until no characters are left.

For "گاز," the respective codes are "9," "0" and "4," following the same lookup and concatenation procedure.

In the end, we obtain the PERTO code "14904" for the given word, which has the same length as the original word.

figure a

Pseudocode 1  PERTO code generation algorithm

In the appraisal segment of our research, we will meticulously scrutinize the impact of the PERTO algorithm on the accuracy of spelling rectification within the healthcare sector. Through a comprehensive examination of the outcomes, our aim is to measure the effectiveness of this algorithm in enhancing the accuracy of spelling rectification, particularly designed for Persian medical text. This endeavor will provide valuable insights into the potential applications and benefits of the PERTO algorithm in real-world scenarios.

Error detection module

The error detection module utilizes two separate strategies based on the nature of the error being identified. For non-word errors, a lexical lookup approach is employed, while real-word errors are addressed through contextual analysis. The initial step in error detection, irrespective of the error type, involves boundary detection and token identification. Upon receiving an input sentence S, the model first demarcates the start and end of the sentence with Beginning of Sentence \((BoS\) ) and End of Sentence ( \(EoS\) ) markers, respectively, markers respectively, and approximates the word count in the sentence:

It’s crucial to note that the word count corresponds to the maximum number of iterations the model will undertake to identify an error in the sentence.

Non-word error detection

Spell checkers predominantly employ the lexical lookup method to detect spelling errors. This technique involves comparing each word in the input sentence with a reference dictionary in real-time, which is usually built using a hash table. Beginning with the \(BoS\) marker, the model scrutinizes every token in the sentence for its correctness based on its sequence. This process continues until the \(EoS\) marker is reached. However, if a word is identified as misspelled, the error detection cycle halts and the error correction phase commences. Here's an illustration of non-word error detection:

figure b

In the given example, the word intended to be typed was (“مایع” /mɑye / ‘fluid’), but it was mistakenly typed as ‘مایغ’. This error is due to a substitution operation and is a single unit of distance away from the correct word. The model was successful in promptly identifying this error.

Real-word error detection

In this study, we employ contextual analysis for the detection of real-word errors. Traditional statistical models relied on n-gram language models to examine the frequency of a word's occurrence and assess the word's context by considering the frequency of the word appearing with " n " preceding terms. However, contemporary approaches use neural embeddings to evaluate the semantic fit of words within a given sentence. In our proposed methodology, we utilize the mask feature and leverage contextual scores derived from the fine-tuned bidirectional language model to detect and correct word errors. The process of real-word error detection is explained as follows:

The model begins with the BoS marker and attempts to encode each word as a masked word, starting with the first word.

A list of potential replacements for the masked word is derived from the output of the pre-trained model.

Based on the candidate generation scenario, replacement candidates are generated within edit-distances of 1 and 2 from the masked word.

The list of candidates, along with the original token, is cross-verified against the pre-trained model’s output for the masked token.

If a candidate demonstrates a probability value that surpasses that of the masked word, the initial word is considered erroneous, thus bringing the procedure to a close.

However, if no error is detected, the model shifts one unit to the left, and the same steps are reiterated for all words within the sentence until the EoS marker is encountered.

Therefore, the moment an error is identified, the correction process is initiated immediately; subsequently, the model advances to the next sentence. Pseudocode2 offers an in-depth exploration of the Real-word error detection process.

figure c

Pseudocode 2  Real-word error detection algorithm

Here's an illustration of successful real-world error detection:

figure d

In the given example, the term ( “اینترارکتال” /intrarectɑl/ ‘intrarectal’) is identified as a real-word error. The word that the user intended to type was (“اینتراداکتال” /intrɑductɑl/ ‘intraductal’). Initially, the model encodes the masked token and feeds it into the pre-trained model, which subsequently generates a list of contextually appropriate tokens. Following this, a roster of potential replacement candidates is created using the Damerau-Levenshtein distance measure. In this instance, the edit-distance is 2. The model then juxtaposes the context similarity score of each replacement candidate with the output list derived from the pre-trained model. Table 4 showcases the context similarity scores of the top two replacement candidates.

Error correction module

The error correction phase is initiated when an error is identified in the input. In this stage, we devise a ranking algorithm that primarily relies on the contextual scores obtained from the fine-tuned pre-trained model and the corresponding PERTO codes between potential candidates and the errors.

Non-word error correction process

In the non-word error correction process, the following steps are undertaken:

The model initially employs the Damerau-Levenshtein edit distance measure to generate a set of replacement candidates within 1 or 2 edits.

The misspelled word is subsequently encoded as a “mask” and input into the fine-tuned model.

The model extracts all probable words from the output and matches them against the candidate list.

The model then retains a certain number of candidates with the highest contextual scores. Based on our observations, the optimal number is 10.

The method proceeds to compare the PERTO similarity between the erroneous word and the remaining replacement candidates. If the error and candidate share the same code, that candidate is considered the most suitable word. However, if two or more probable candidates carry the same PERTO code as the erroneous word, then the candidate with the highest contextual score is selected as the replacement for the error.

Pseudocode3 delivers a comprehensive exploration of the Non-word error correction mechanism.

figure e

P seudocode 3  Non-word error correction algorithm

Real-word error correction process

In the scenario of real-word error correction, the process is as follows:

The contextual scores of potential candidates are retrieved from the fine-tuned model.

The model retains a certain number of candidates with the highest contextual score. Based on our observations, the optimal number is 10.

The method then compares the PERTO similarity between the erroneous word and the replacement candidates. If the error and the candidate share the same code, that candidate is deemed the most suitable word.

However, if two or more probable candidates carry the same PERTO code as the erroneous word, then the candidate with the highest contextual score is selected as the replacement for the error.

Pseudocode4 delivers a comprehensive exploration of the Non-word error correction mechanism.

figure f

Pseudocode 4  Real-word error correction algorithm

Evaluation and results

In this section, we first conduct an analysis of the test data. Following this, we evaluate our method's performance and compare it with various baseline models in the task of spelling correction. This comparison will offer valuable insights into the efficacy and precision of our approach in identifying and rectifying spelling errors.

Test dataset

Our test datasets consist of 188,963 reserved sentences derived from the Persian clinical corpus. Upon scrutinizing the errors present in the test dataset, we found that 1.20% of sentences exhibited instances of non-word errors, which equates to 120 errors in every 10,000 sentences. In addition, 0.29% of sentences contained a real-word error, corresponding to 29 errors in every 10,000 sentences. We examined all the erroneous words to categorize them into one of the predefined classes of errors, such as substitution, transposition, insertion, and deletion. The frequency of these errors, based on the error type, is illustrated in Table  5 . When addressing both real-word and non-word errors, substitution errors are more prevalent than other types of errors. Furthermore, insertion errors are quite common when dealing with both classes of error, while deletion and transposition errors are the least common.

We also analyzed the test dataset for the number of edit distances required for spell correction, the results of which are presented in Table  6 . In dealing with both real-word and non-word errors, 86.1% of misspellings required an edit distance of 1 to correct the incorrect word. 13.7% of errors were rectified with an edit distance of 2, and a mere 2.1% of errors fell within an edit-distance of 3 or more. Due to the combinatorial explosion when generating and examining candidates within distance 3, these classes of error were excluded from the dataset.

Upon conducting a more thorough analysis of the data, we found that 0.8% of sentences contained more than one error. As our method is designed to handle only one-error-per-sentence, we removed these sentences from the test dataset.

Evaluation metrics

The principal metrics for evaluating the effectiveness of models on tasks related to non-word and real-word error identification and rectification are precision (P), recall (R), and the F-measure (F1-Score). Precision (P) quantifies the model's accuracy, whereas recall evaluates its comprehensiveness or sensitivity. The F1-Score, a weighted harmonic average of these two metrics, can be computed by integrating them. In F1, both precision and recall are given equal weight. Equation  4 describes the F1-Score evaluation measure.

Baseline models

In our research, we implemented two baseline models for non-word correction in Persian clinical text to ensure a comprehensive comparison. These models include the four-gram model introduced by [ 57 ], and a Persian Continuous Bag-of-Words (CBOW) model [ 90 ]. Both models were developed using Python and trained on the same dataset as the pre-trained model. Our aim is to understand the strengths and weaknesses of these models, and leverage this understanding to enhance error correction in Persian language processing. Unfortunately, for real-word error correction in the Persian medical domain, no prior work has been introduced. Therefore, a meaningful comparison is not achievable at this time. This highlights the novelty and importance of our research in this specific area.

Yazdani, et al.

The statistical methodology, pioneered by Yazdani and colleagues, stands out as a promising approach for rectifying non-word errors. It is meticulously crafted to address typographical inaccuracies prevalent in Persian healthcare text, thereby enhancing the quality and reliability of the information [ 57 ]. This method leverages a weighted bi-directional fourgram language model to pinpoint the most appropriate substitution for a given error. It incorporates a quadripartite equation that assigns priority to n-grams based on their sequence, thereby enhancing the precision of error correction.

CBOW model operates by comprehending the semantics of words through the analysis of their surrounding context, and then uses this information as input to predict suitable words for the given context [ 90 ]. The architecture of the CBOW model is designed to identify the target word (the center word) based on the context words provided. This model has been specifically trained to tackle the task of non-word error rectification. It employs two matrices to calculate the hidden layer (H): the input matrix (IM) and the output matrix (OM). The CBOW model was trained using a corpus of 1.4 million documents derived from the pre-trained model, which facilitated the generation of the input and output matrices. The training parameters incorporated a context window size of 10 and a dimension size of 300.

Non-word error correction evaluation

In the initial phase of assessment, we juxtapose the effectiveness of our suggested methodology with that of the previously mentioned baseline models concerning non-word error rectification. It's crucial to highlight that all models employ a dictionary look-up method for identifying typos, resulting in an F1-score of 100% for typo detection. Table 7 presents the results of the non-word error correction task, providing a detailed comparison of the effectiveness of our approach and the baseline models.

Table 7 provides a detailed comparison of the performance of various models on the non-word error correction task. It compares two configurations of our proposed approach with statistical baselines and the CBOW model. To gauge their effectiveness in practical scenarios, all models were subjected to an extensive array of test instances. The results clearly indicate that both configurations of our approach outperform the other models, demonstrating superior performance. The model achieves its best performance, with an F1-Score of 90.0%, when the PERTO algorithm is employed. The combination of contextual similarity with the PERTO algorithm proves to be the most robust scheme, offering a 1.1% increase in correcting non-word errors compared to using only contextual scores.

The authors of [ 57 ] reported achieving an F1-Score of 90.2% for non-word error correction. However, our attempts to reproduce this result in our evaluations were unsuccessful.

In fact, the approach by Yazdani et al. shows the lowest performance, with an F1-Score of 74.6%. The Contextual Scores + PERTO scheme outperforms Yazdani et al.'s approach by 15.4%, further demonstrating the robustness of our method. In terms of the proposed approach, the results of the scheme that combines contextual scores and PERTO are significantly superior to those achieved using only contextual scores. The most effective results are achieved when the pretrained model is used in conjunction with the PERTO orthographic similarity algorithm. Our observations confirm that the PERTO algorithm significantly enhances results, as substitution errors, which are predominantly either visually or phonetically similar, account for 49.1% of all non-word errors in the test corpus. This is in comparison to insertion, deletion, and transposition errors. This underscores the effectiveness of our approach in handling substitution errors.

Real-word error detection and correction evaluations

We performed a comprehensive evaluation of our proposed model for detecting and correcting real-word errors in Persian clinical text. The results of these evaluations are summarized in Table  8 . Our model demonstrated its highest performance in real-word error detection, achieving an F1-Score of 90.6%.

We further evaluated our model’s ability to correct real-word errors. As depicted in Table  8 , our suggested approach, particularly when enhanced with the PERTO algorithm, exhibits outstanding performance in correcting real-word errors across a range of distances. The model reached its highest F1-Score of 0.915 when the Persian orthographic similarity algorithm was employed, indicating an approximate enhancement of 1.5% in the correction F1-Score. It’s noteworthy that the PERTO significantly enhances the results as substitution errors constitute 47.8% of all real-word errors in the test corpus, compared to insertion, deletion, and transposition errors. Furthermore, a significant portion of these substitution errors bear a visual resemblance.

We also conducted a comprehensive analysis of the errors made by our model. We discovered that in a few cases, real-word errors were missed when the erroneous word had a strong semantic connection to the context words. For instance, in the original word sequence “روده اطراف تستیس راست رویت شد” (The presence of the intestine was observed around the right testis.), the medical typist mistakenly replaced the intended word (“روده” / rʊdeh/ ‘intestine’) with the erroneous word (“توده” /tʊdeh/ ‘mass’), which is within an edit distance of 1. This resulted in the word sequence “توده اطراف تستیس راست رویت شد” (A mass was observed surrounding the right testicle), which had a higher context similarity score than the original word sequence. Consequently, this word sequence was overlooked by the model.

While this issue has not been highlighted in previous research on Persian spelling correction, we believe it poses a significant challenge in addressing real-word errors in Persian clinical texts. To prevent such errors from being overlooked, we could present a list of the most probable candidates along with their context scores to a human expert, allowing them to select the most appropriate replacement. This emphasizes that, despite the advancements in state-of-the-art models, human expertise remains indispensable in certain situations.

In summary, the results indicate that our proposed method exhibits robustness and precision in detecting and rectifying context-sensitive errors in Persian clinical text, thereby affirming its potential for practical application in the field.

Typographical errors, a frequent occurrence in radiology reports often attributed to incessant interruptions and a dynamic work environment, have the potential to endanger patient health, introduce ambiguity, and undermine the reputation of radiologists [ 91 ]. The cardinal goal of our research was to pioneer an avant-garde technique for pinpointing and rectifying spelling inaccuracies in Persian clinical text. The elaborate morphology and syntax of the Persian language, intertwined with the pivotal role of meticulous documentation in fostering effective patient care, facilitating research, and safeguarding patient safety, accentuate the gravity of this undertaking. Within the confines of the Imaging Department at Imam Khomeini Hospital, the formulation of radiology reports is an intricate multi-step endeavor that averages around 30 min in duration.

This process includes dictation by radiologists, transcription by medical typists, and a review and editing process before the final report is stored in the HIS. However, this process includes non-value-added activities, known as ‘Muda’, particularly the time spent between transcription and final confirmation [ 92 ]. Our newly developed software addresses this inefficiency by quickly correcting misspelled words during transcription, reducing the time between initial writing and final confirmation, and thereby decreasing ‘Muda’.

Our approach leverages a pre-trained language representation model, fine-tuned specifically for the task of spelling correction in the clinical domain. This model is complemented by an innovative orthographic similarity matching algorithm, PERTO, which uses visual similarity of characters for ranking correction candidates. This unique combination of techniques distinguishes our approach from existing methods, enabling our model to effectively address both non-word and real-word errors. The evaluation of our approach demonstrated its robustness and precision in detecting and rectifying word errors in Persian clinical text. In terms of non-word error correction, our model achieved an F1-Score of 90.0% when the PERTO algorithm was employed. This represents a 1.1% increase in correcting non-word errors compared to using only contextual scores. For real-word error detection, our model demonstrated its highest performance, achieving an F1-Score of 90.6%. Furthermore, the model reached its highest F1-Score of 91.5% for real-word error correction when the PERTO algorithm was employed, indicating an approximate enhancement of 1.5% in the correction F1-Score.

Despite these promising results, our model has certain limitations. For instance, in a few cases, real-word errors were missed when the erroneous word had a strong semantic connection to the context words. Additionally, while our model is effective in handling non-word and real-word errors, it is not equipped to deal with grammatical errors. Moreover, our model was set up to handle one-error-per-sentence cases and cannot handle more than one error in a sentence. There were a few cases where a sentence included more than two errors.

Building upon our current achievements, the integration of emerging technologies such as BCI eye-tracking, VR/AR, and EEG offers a promising frontier for further enhancing our spelling correction system. These technologies present unique opportunities to address some of the inherent limitations identified in our study. For example, BCIs could offer intuitive, direct error correction interfaces, while eye-tracking might refine error detection based on user interaction patterns. VR/AR could provide immersive training environments, improving proficiency with correction tools, and EEG monitoring could lead to spelling correction interfaces that adapt to user stress levels and cognitive states, ultimately making the correction process less taxing and more efficient.

While prevailing spelling correction mechanisms for the Persian language cater to a broad spectrum and are not tailored to the medical sphere, our innovative system is specifically architected to autonomously pinpoint and amend misspellings prevalent in Persian radiology and ultrasound reports. The seamless integration of automatic spell-checking systems, notably in critical facets for patient safety such as allergy entries, medication details, diagnoses, and problem listings, can substantially bolster the quality and exactness of electronic medical records. Our system, which can be seamlessly integrated as an auxiliary program on platforms like Microsoft Office Word, web-browsers, or employed as an API in the HIS system, expands the potential applications of our model transcending the boundaries of the clinical domain.

In summary, the results of this study affirm the potential of our proposed method in transforming Persian clinical text processing. By effectively addressing the unique challenges posed by the Persian language and integrating cutting-edge technologies, our approach paves the way for more accurate and efficient clinical documentation, contributing to improved patient care and safety.

This study presents a novel method for detecting and correcting spelling errors in Persian clinical texts, leveraging a pre-trained model fine-tuned for this specific domain. Our approach has notably outperformed existing models, achieving F1-Scores of over 90% in both real-word and non-word error correction. This advancement underscores the method's robustness and its wide-ranging applicability, from error types like substitution and insertion to deletion and transposition. By integrating our orthographic similarity algorithm, PERTO, with contextual insights, we've significantly enhanced the correction success rate, marking a substantial improvement in spelling error correction for Persian clinical texts.

The potential of our methodologies extends beyond medical documentation, offering valuable applications in engineering sciences. The NLP and machine learning techniques employed here could revolutionize error detection and correction in engineering documents and software code, improving review processes, technical documentation accuracy, and software development efficiency. Furthermore, our findings could inform the creation of intelligent diagnostic systems for predictive maintenance and quality control, leveraging our error correction mechanisms for enhanced precision and reliability.

Looking ahead, we aim to refine our model further to tackle multiple errors within a sentence and address grammatical inaccuracies, broadening our method's comprehensiveness for the Persian medical domain. Additionally, we plan to explore the integration of emerging technologies like BCI, eye-tracking, VR/AR, and EEG, aiming to create more intuitive correction interfaces and immersive training environments. These efforts will not only advance spelling correction tools technically but also amplify their practical impact in medical documentation, contributing to improved patient care and safety.

Availability of data and materials

The data that support the findings of this study are held by Imam Khomeini Hospital. They are not publicly accessible due to privacy restrictions. However, they may be available from the authors upon reasonable request and with permission from the hospital.

All pronunciations have been provided in International Phonetic Alphabet (IPA).

Abbreviations

Electronic health record

Optical character recognition

Automatic Speech Recognition

Natural Language Processing

Hospital information system

Brain-Computer Interfaces

Virtual Reality

Augmented Reality

Electroencephalography

Masked Language Model

Wong W, Glance D. Statistical semantic and clinician confidence analysis for correcting abbreviations and spelling errors in clinical progress notes. Artif Intell Med. 2011;53(3):171–80.

Article   PubMed   Google Scholar  

Zhou L, et al. Analysis of errors in dictated clinical documents assisted by speech recognition software and professional transcriptionists. JAMA Netw Open. 2018;1(3):e180530–e180530.

Article   PubMed   PubMed Central   Google Scholar  

Turchin A, et al. Identification of misspelled words without a comprehensive dictionary using prevalence analysis. AMIA Ann Symp Proc. 2007;2007:751–5 American Medical Informatics Association.

Google Scholar  

Wilcox-O’Hearn A, Hirst G, Budanitsky A. Real-word spelling correction with trigrams: A reconsideration of the Mays, Damerau, and Mercer model. In: International conference on intelligent text processing and computational linguistics. Berlin, Heidelberg: Springer Berlin Heidelberg; 2008. p. 605–16.

Hirst G, Budanitsky A. Correcting real-word spelling errors by restoring lexical cohesion. Nat Lang Eng. 2005;11(1):87–111.

Article   Google Scholar  

Bassil Y, Alwani M. OCR context-sensitive error correction based on Google web 1t 5-gram data set. Am J Sci Res. 2012;50.

Deng L, Huang X. Challenges in adopting speech recognition. Commun ACM. 2004;47(1):69–75.

Hartley RT, Crumpton K. Quality of OCR for degraded text images. In: Proceedings of the fourth ACM conference on Digital libraries. 1999. p. 228–9.

Jurafsky D, James H, Martin J. Speech and Language Processing: An Introduction to Natural Language Processing. Computational Linguistics, and Speech Recognition. 2nd ed. New Jersey: Prentice-Hall; 2008.

Atkinson K. Gnu aspell 0.60. 4. 2006, GNU Aspell) Retrieved from http://aspell.net

Damerau FJ. A technique for computer detection and correction of spelling errors. Commun ACM. 1964;7(3):171–6.

Idzelis M and Galbraith B. Jazzy: The java open source spell checker; 2005, Retrieved 2019/10/10, from http://jazzy.sourceforge.net

Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys Doklady. 1966;10:8 Soviet Union.

Dashti SMS, et al. Toward a thesis in automatic context-sensitive spelling correction. Int J Artif Intell Mechatron. 2014;3(1):19–24.

Mays E, Damerau FJ, Mercer RL. Context based spelling correction. Inf Process Manage. 1991;27(5):517–22.

Samanta P, Chaudhuri BB. A simple real-word error detection and correction using local word bigram and trigram. In: Proceedings of the 25th conference on computational linguistics and speech processing (ROCLING 2013). 2013.

Wilcox-O'Hearn LA. Detection is the central problem in real-word spelling correction. 2014. arXiv preprint arXiv:1408.3153.

Dashti SM, KhatibiBardsiri A, Khatibi Bardsiri V. Correcting real-word spelling errors: A new hybrid approach. Digital Sch Humanit. 2018;33(3):488–99.

Dashti SM. Real-word error correction with trigrams: correcting multiple errors in a sentence. Lang Resour Eval. 2018;52(2):485–502.

Pande H. Effective search space reduction for spell correction using character neural embeddings. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 2017.

Hu Y, Jing X, Ko Y, Rayz JT. Misspelling Correction with Pre-trained Contextual Language Model. 2020 IEEE 19th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC). IEEE: Beijing; 2020. p. 144–49. https://doi.org/10.1109/ICCICC50026.2020.9450253 .

Lee J-H, Kim M, Kwon H-C. Deep learning-based context-sensitive spelling typing error correction. IEEE Access. 2020;8:152565–78.

Sun R, Wu X, Wu Y. An Error-Guided Correction Model for Chinese Spelling Error Correction. In: Findings of the Association for Computational Linguistics: EMNLP 2022. 2022. p. 3800–10.

Jayanthi SM, Pruthi D, Neubig G. NeuSpell: A Neural Spelling Correction Toolkit. EMNLP 2020. 2020:158.

Ji T, Yan H, Qiu X. SpellBERT: A lightweight pretrained model for Chinese spelling check. In: Proceedings of the 2021 conference on empirical methods in natural language processing. 2021.

Liu S, et al. PLOME: Pre-training with misspelled knowledge for Chinese spelling correction. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021.

Zhang R, et al. Correcting Chinese spelling errors with phonetic pre-training. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021.

Wang X, et al. Towards contextual spelling correction for customization of end-to-end speech recognition systems. IEEE/ACM Trans Audio, Speech Lang Proc. 2022;30:3089–97.

Zhu C, et al. MDCSpell: A multi-task detector-corrector framework for Chinese spelling correction. In: Findings of the Association for Computational Linguistics: ACL 2022. 2022.

Liu S, et al. CRASpell: A contextual typo robust approach to improve Chinese spelling correction. In: Findings of the Association for Computational Linguistics: ACL 2022. 2022.

Salhab M, Abu-Khzam F. AraSpell: A Deep Learning Approach for Arabic Spelling Correction. 2023.

Dalianis H, Dalianis H. Characteristics of patient records and clinical corpora. In: Clinical Text Mining: Secondary Use of Electronic Patient Records. 2018. p. 21–34.

Chapter   Google Scholar  

Hussain F, Qamar U. Identification and correction of misspelled drugs’ names in electronic medical records (EMR). In: International Conference on Enterprise Information Systems, vol. 3. SCITEPRESS; 2016. p. 333–8.

Kilicoglu H, et al. An ensemble method for spelling correction in consumer health questions. AMIA Annu Symp Proc. 2015;2015:727 American Medical Informatics Association.

PubMed   PubMed Central   Google Scholar  

Zhou X, et al. Context-sensitive spelling correction of consumer-generated content on health care. JMIR Med Inform. 2015;3(3): e4211.

Ruch P, Baud R, Geissbühler A. Using lexical disambiguation and named-entity recognition to improve spelling correction in the electronic patient record. Artif Intell Med. 2003;29(1–2):169–84.

Siklósi B, Novák A, Prószéky G. Context-aware correction of spelling errors in Hungarian medical documents. In: Statistical Language and Speech Processing: First International Conference, SLSP 2013. Proceedings 1 2013. Tarragona: Springer Berlin Heidelberg; 2013. p. 248–59.

Grigonyte G, et al. Improving readability of Swedish electronic health records through lexical simplification: First results. In: European Chapter of ACL (EACL), 26–30 April, 2014. Gothenburg: Association for Computational Linguistics; 2014.

Tolentino HD, et al. A UMLS-based spell checker for natural language processing in vaccine safety. BMC Med Inform Decis Mak. 2007;7:1–13.

Doan S, et al. Integrating existing natural language processing tools for medication extraction from discharge summaries. J Am Med Inform Assoc. 2010;17(5):528–31.

Lai KH, et al. Automated misspelling detection and correction in clinical free-text records. J Biomed Inform. 2015;55:188–95.

Fivez P, Šuster S, Daelemans W. Unsupervised context-sensitive spelling correction of English and Dutch clinical free-text with word and character n-gram embeddings. 2017. arXiv preprint arXiv:1710.07045.

Pérez A, et al. Inferred joint multigram models for medical term normalization according to ICD. Int J Med Informatics. 2018;110:111–7.

Khan MF, et al. Augmented reality based spelling assistance to dysgraphia students. J Basic Appl Sci. 2017;13:500–7.

Li Y, et al. Exploring text revision with backspace and caret in virtual reality. In: Proceedings of the 2021 CHI conference on human factors in computing systems. 2021.

Lim J-H, et al. Development of a hybrid mental spelling system combining SSVEP-based brain–computer interface and webcam-based eye tracking. Biomed Signal Process Control. 2015;21:99–104.

Mora-Cortes A, et al. Language model applications to spelling with brain-computer interfaces. Sensors. 2014;14(4):5967–93.

D’hondt E, Grouin C, Grau B. Low-resource OCR error detection and correction in French Clinical Texts. In: Proceedings of the seventh international workshop on health text mining and information analysis. 2016.

Tran K, Nguyen A, Vo C, Nguyen P. Vietnamese Electronic Medical Record Management with Text Preprocessing for Spelling Errors. 2022 9th NAFOSTED Conference on Information and Computer Science (NICS), Ho Chi Minh City: IEEE; 2022. p. 223–9. https://doi.org/10.1109/NICS56915.2022.10013386 .

Dastgheib MB, Fakhrahmad SM, Jahromi MZ. Perspell: a new Persian semantic-based spelling correction system. Digit Sch Humanit. 2017;32(3):543–53.

Ghayoomi M, Assi SM. Word prediction in a running text: A statistical language modeling for the Persian language. In: Proceedings of the Australasian Language Technology Workshop 2005. 2005.

Kashefi O, Sharifi M, Minaie B. A novel string distance metric for ranking Persian respelling suggestions. Nat Lang Eng. 2013;19(2):259–84.

MosaviMiangah T. FarsiSpell: a spell-checking system for Persian using a large monolingual corpus. Literary Linguist Comput. 2014;29(1):56–73.

Naseem T, Hussain S. A novel approach for ranking spelling error corrections for Urdu. Lang Resour Eval. 2007;41(2):117–28.

Shamsfard M. Challenges and open problems in Persian text processing. Proceedings of LTC. 2011;11:65–9.

Shamsfard M, Jafari HS, Ilbeygi M. STeP-1: A set of fundamental tools for Persian text processing. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10). 2010.

Yazdani A, et al. Automated misspelling detection and correction in Persian clinical text. J Digit Imaging. 2020;33:555–62.

Faili H, Ehsan N, Montazery M, Pilehvar MT. Vafa spell-checker for detecting spelling, grammatical, and real-word errors of Persian language. Digital Scholarsh Humanit. 2016;31(1):95–117.

Ghayoomi M, Momtazi S, Bijankhan M. A Study of Corpus Development for Persian. Int J Asian Lang Process. 2010;20(1):17–34.

Farshbafian A, Asl ES. A metafunctional approach to word order in Persian language. J Lang Linguist Stud. 2021;17(S2):773–93.

Seraji M, Megyesi B, Nivre J. A basic language resource kit for Persian. In: Eight International Conference on Language Resources and Evaluation (LREC 2012), 23–25 May 2012. Istanbul: European Language Resources Association; 2012.

Miangah TM, Vulanović R. The Ambiguity of the Relations between Graphemes and Phonemes in the Persian Orthographic System. Glottometrics. 2021;50:9–26.

Modarresi Ghavami G. Vowel Harmony and Vowel-to-Vowel Coarticulation in Persian. Language and Linguistics. 2010;6(11):69–86.

Sedighi A. Persian in use: An Elementary Textbook of Language and Culture. 1st ed. Leiden University Press; 2015.  https://www.muse.jhu.edu/book/46336 .

Mozafari J, et al. PerAnSel: a novel deep neural network-based system for Persian question answering. Comput Intell Neurosci. 2022;2022:3661286.

Ghomeshi J. The additive particle in Persian: A case of morphological homophony between syntax and pragmatics. Adv Iran Linguist. 2020;1:57–84.

Bonyani M, Jahangard S, Daneshmand M. Persian handwritten digit, character and word recognition using deep learning. Int J Doc Anal Recognit. 2021;24(1–2):133–43.

Rasooli MS, et al. Automatic standardization of colloquial Persian. 2020. arXiv preprint arXiv:2012.05879.

Farahani M, et al. Parsbert: Transformer-based model for persian language understanding. Neural Process Lett. 2021;53:3831–47.

Dehkhoda AA. Dehkhoda dictionary. Tehran: Tehran University; 1998. p. 1377.

Peterson JL. A note on undetected typing errors. Commun ACM. 1986;29(7):633–7.

Huang Y, Murphey YL, Ge Y. Automotive diagnosis typo correction using domain knowledge and machine learning. 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Singapore: IEEE; 2013. p. 267–74. https://doi.org/10.1109/CIDM.2013.6597246 .

Kukich K. Techniques for automatically correcting words in text. ACM Comput Surv (CSUR). 1992;24(4):377–439.

Dowsett DJ. Radiological sciences dictionary : keywords, names and definitions. 1st ed. Hodder Arnold; 2009. https://doi.org/10.1201/b13300 .

Pennington J, Socher R, Manning CD. Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.

Mikolov T, et al. Distributed representations of words and phrases and their compositionality. Adv Neural Inf Proc Syst. 2013;26:3111–9.

Mikolov T, Yih WT, Zweig G. Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies. 2013.

Goldberg Y. A primer on neural network models for natural language processing. J Artif Intell Res. 2016;57:345–420.

Radford A, et al. Improving language understanding by generative pre-training. 2018.

Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. p. 4171–86.

Sarzynska-Wawer J, et al. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Res. 2021;304: 114135.

Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave É, Ott M, Zettlemoyer L, Stoyanov V. Unsupervised Cross-lingual Representation Learning at Scale. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. p. 8440–51.

Raffel C, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21(140):1–67.

Yang Z, et al. Xlnet: Generalized autoregressive pretraining for language understanding. Adv Neural Inf Proc Syst. 2019;32:1–11.

Liu Y, et al. Roberta: a robustly optimized bert pretraining approach; 2019. arXiv preprint arXiv:1907.11692.

Wang W, Bao F, Gao G. Learning morpheme representation for mongolian named entity recognition. Neural Process Lett. 2019;50(3):2647–64.

Taghizadeh N, et al. SINA-BERT: a pre-trained language model for analysis of medical texts in Persian. 2021. arXiv preprint arXiv:2104.07613.

Abadi M, et al. Tensorflow: a system for large-scale machine learning. Savannah: Osdi; 2016.

Ketkar N, Ketkar N. Introduction to keras. Deep learning with python: a hands-on introduction. 2017. p. 97–111.

Mikolov T, et al. Efficient estimation of word representations in vector space. 2013. arXiv preprint arXiv:1301.3781.

Minn MJ, Zandieh AR, Filice RW. Improving radiology report quality by rapidly notifying radiologist of report errors. J Digit Imaging. 2015;28:492–8.

Kruskal JB, et al. Quality initiatives: lean approach to improving performance and efficiency in a radiology department. Radiographics. 2012;32(2):573–87.

Download references

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and affiliations.

Department of Computer Engineering, Kerman Branch, Islamic Azad University, Kerman, Iran

Seyed Mohammad Sadegh Dashti

Department of Advanced Research, Bushehr University of Medical Sciences, Bushehr, Iran

Seyedeh Fatemeh Dashti

You can also search for this author in PubMed   Google Scholar

Contributions

Seyed Mohammad Sadegh Dashti and Seyedeh Fatemeh Dashti conceptualized and designed the study. Seyed Mohammad Sadegh Dashti developed the model and performed the experiments. Seyedeh Fatemeh Dashti collected and analyzed the data. Both authors contributed to writing the manuscript and approved the final version for publication.

Corresponding author

Correspondence to Seyed Mohammad Sadegh Dashti .

Ethics declarations

Ethics approval and consent to participate.

The study was conducted in accordance with ethical standards and received approval from the Institutional Review Board of the Islamic Azad University, Kerman Branch (approval ID: IR.IAU.KERMAN.REC.1402.124). As the study did not involve any human trials, the requirement for informed consent was waived by the same Institutional Review Board.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Dashti, S.M.S., Dashti, S.F. Improving the quality of Persian clinical text with a novel spelling correction system. BMC Med Inform Decis Mak 24 , 220 (2024). https://doi.org/10.1186/s12911-024-02613-0

Download citation

Received : 21 October 2023

Accepted : 17 July 2024

Published : 05 August 2024

DOI : https://doi.org/10.1186/s12911-024-02613-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Real-word error
  • Non-word error
  • Spelling correction
  • Contextualized embeddings
  • Deep learning
  • Radiology reporting

BMC Medical Informatics and Decision Making

ISSN: 1472-6947

spell checker research paper

IMAGES

  1. Paperpal's Free Spell Checker for Error-Free Research Papers

    spell checker research paper

  2. Systematic review of spell-checkers for highly inflectional languages

    spell checker research paper

  3. Automatic Spell Checker and Correction for Under-represented Spoken

    spell checker research paper

  4. (PDF) A UMLS-based spell checker for natural language processing in

    spell checker research paper

  5. PPT

    spell checker research paper

  6. Spell checker module.

    spell checker research paper

VIDEO

  1. Text-Editor with spell checker project using java with Trie Data Structure #project #java

  2. Babylon 9 Automatic Spell Checker Feature

  3. Advance Feature of Elicit: AI Research Assistant Tool

  4. Spell Checker Demo

  5. Q1 -Q4 Journal Ranking Criteria| Latest Journal Metrics

  6. When The Internal Spell Check Breaks?!? #comedy #podcast #benchwarmers #spelling

COMMENTS

  1. (PDF) The Effect of Spell-Checker Features on Spelling ...

    The Effect of Spell-Checker Features on Spelling Competence among EFL Learners: An Empirical Study ... Our survey selected papers about spelling correction indexed in Scopus and Web of Science ...

  2. The Influence of Spell-checkers on Students' Ability to Generate

    Recent studies show that spell- checkers help reduce students' surface errors in writing by flagging. spelling errors and giving correct spelling suggestions. This study investigates if the ...

  3. Systematic review of spell-checkers for highly ...

    Paul Stefan Popescu. Cristian Marian Mihaescu. Request PDF | Systematic review of spell-checkers for highly inflectional languages | Performance of any word processor, search engine, social media ...

  4. Systematic review of spell-checkers for highly inflectional ...

    Out of these seven papers, one paper is related to Malayalam word generation, two are related to morphological analysis and the remaining 4 papers propose spell-checking techniques. First three papers answer the research questions Q3, Q4 and Q5 along with Q9 and Q10 where as other four papers which propose the spell-checking techniques are able ...

  5. Systematic review of spell-checkers for highly inflectional languages

    This article suggests how the techniques from the other domains like morphology, part-of-speech, chunking, stemming, hash-table etc. can be used in development of spell-checkers. It also highlights the major challenges faced by researchers along with the future area of research in the field of spell-checking.

  6. A context sensitive real-time Spell Checker with language adaptability

    Index Terms—spell checker, auto-correct, n-grams, tokenizer, context-aware, real-time I. INTRODUCTION Spell checker and correction is a well-known and well-researched problem in Natural Language Processing [1]-[4]. However, most state-of-the-art research has been done on spell checkers for English [5], [6]. Some systems might be extended

  7. Spelling Checker Algorithm Methods for Many Languages

    Spell checking plays an important role in improving document quality by identifying misspelled words in the document. The spelling check method aims to verify and correct misspelled words through a series of suggested words that are closer to the wrong word. Currently, Spell checkers for English language are well established. This article analyses several studies conducted in various types of ...

  8. Free AI Spell Checker

    How does this spell checker work? This is an AI-powered online spell checker. To never worry about spelling mistakes again, just copy-paste or type your text into the box. The spell checker will then go through what you've written and mark any errors in red. To fix all your mistakes, click the green "Fix All Errors" button to correct your ...

  9. Amazigh spell checker using Damerau-Levenshtein algorithm and N-gram

    Spell checkers have been an area of research since the 1960s (Kukich, ... In this paper, we are interested in the Amazigh language spelling correction, based on the combination of Damerau-Levenshtein algorithm and N-gram. ... A spell checker is, essentially, proceeds in two stages: the detection and the correction of spelling errors.

  10. Free Online Proofreader

    Fix mistakes that slip under your radar. Fix problems with commonly confused words, like affect vs. effect, which vs. that and who vs. that. Catch words that sound similar but aren't, like their vs. they're, your vs. you're. Check your punctuation to avoid errors with dashes and hyphens, commas, apostrophes, and more.

  11. PDF Spell Checker

    spell checker carries out the following processes: It scans the text and selects the words contained in it. It then compares each word with a known list of correctly spelled words (i.e. a dictionary). This spell checker might contain just a list of words, or it might contain additional information, such as word division points or lexical and ...

  12. [PDF] Design and Implementation of NLP-based Spell Checker for the

    An advanced NLP technique is used to detect wrongly spelled words in the Tamil language text, and to provide possible correct word suggestions and the probability of occurrence of each word in the corpus. : A spell checker is a tool used for analyzing and validating spelling mistakes in the text. Recently, the role of a spell checker has diversified, and it is also used to suggest possible ...

  13. spell check Latest Research Papers

    Our result analysis demonstrates that with denoising and spell checking, our model has achieved an accuracy of 98.11% when compared to 84.02% without any denoising or spell check mechanism. Download Full-text.

  14. (PDF) Survey of Automatic Spelling Correction

    Our survey. selected papers about spelling correction indexed in Scopus and W eb of Science from 1991 to 2019. The first group uses a set of rules designed in advance. The second group uses an ...

  15. Free Grammar Checker

    Yes, this grammar checker covers the following mistakes: 1. Grammar: Correction of grammatical errors such as subject-verb agreement, tense usage, and sentence structure 2. Spelling: identification and correction of spelling errors, including typos and commonly confused words. 3. Punctuation: Detection and rectification of punctuation errors, including incorrect use of commas, periods, colons ...

  16. A rule-based Afan Oromo Grammar Checker

    A rule based grammar checker is presented that is entirely developed and dependent on the morphology of the language and evaluated and shown a promising result. Natural language processing (NLP) is a subfield of computer science, with strong connections to artificial intelligence. One area of NLP is concerned with creating proofing systems, such as grammar checker. Grammar checker determines ...

  17. Spell Checking Techniques in NLP: A Survey

    This paper is discussing both the approaches and their roles in various applications of spell checkers in Indian languages. Spell checkers in Indian languages are the basic tools that need to be developed. A spell checker is a software tool that identifies and corrects any spelling mistakes in a text. Spell checkers can be combined with other applications or they can be distributed individually.

  18. (PDF) Design and Implementation of NLP-based Spell Checker for the

    PDF | On Nov 9, 2020, Pawan Kumar and others published Design and Implementation of NLP-based Spell Checker for the Tamil Language | Find, read and cite all the research you need on ResearchGate

  19. Accommodations Toolkit

    Spell Check: Research. This fact sheet on spell check is part of the Accommodations Toolkit published by the National Center on Educational Outcomes (NCEO). It summarizes information and research findings on spell check as an accommodation [1]. This toolkit also contains a summary of states' accessibility policies for spell check.

  20. Free Essay and Paper Checker

    Scribbr is committed to protecting academic integrity. Our plagiarism checker, AI Detector, Citation Generator, proofreading services, paraphrasing tool, grammar checker, summarizer, and free Knowledge Base content are designed to help students produce quality academic papers. We make every effort to prevent our software from being used for ...

  21. [PDF] Spell Checker for OCR

    Spell Checker for OCR. Yogomaya Mohapatra, A. Mishra, A. Mishra. Published 2013. Computer Science. TLDR. This system for text correction based on approximate string matching, which uses a statistical model that incorporates techniques like Confusion Matrix and N-gram Analysis, for OCR-generated text, that selects candidate words through the ...

  22. SymSpell4Burmese: Symmetric Delete Spelling Correction Algorithm

    Spell checker is a crucial language tool of natural language processing (NLP) and becomes important due to the increase of text-based communication at work, information retrieval, fraud detection, search engines, social media and research areas. In this paper, automatic spelling checking for Burmese is studied by applying Symmetric Delete Spelling Correction Algorithm (SymSpell). We ...

  23. (PDF) A Research on Online Grammar Checker System Based on Neural

    This new method of online grammar checker system is based on neural network. model, Transformer, and is able to detect about 25 different ty pes of grammatical errors in the. text. We did the ...

  24. Improving the quality of Persian clinical text with a novel spelling

    This research aimed to develop an innovative approach for detecting and correcting spelling errors in Persian clinical text. Our strategy employs a state-of-the-art pre-trained model that has been meticulously fine-tuned specifically for the task of spelling correction in the Persian clinical domain.