Survey of Arabic Checker Techniques

Ahmed Abdalrhman Saty, Karim Bouzoubaa Bouzoubaa, Aouragh Si Lhoussain

Abstract


It is known that the importance of spell checking, which increases with the expanding of technologies, using the Internet and the local dialects, in addition to non-awareness of linguistic language. So, this importance increases with the Arabic language, which has many complexities and specificities that differ from other languages. This paper explains these specificities and presents the existing works based on techniques categories that are used, as well as explores these techniques. Besides, it gives directions for future work.

Keywords


spell checking, rule-based, morphology, n-gram, radix-search tree, levenshtein distance, jaro-winkler distance

Full Text:

Untitled

References


REFERENCES

G. Hicham, Y. Abdallah, and B. Mostapha, (2012), “Introduction of the weight edition errors in the Levenshtein distance,” International Journal of Advanced Research in Artificial Intelligence, vol. 1, no. 5, pp. 30–32.

B. Hamza, Y. Abdellah, G. Hicham, and B. Mostafa, (2014), “For an Independent Spell-Checking System from the Arabic Language Vocabulary,” International Journal of Advanced Computer Science and Applications, vol. 5, no. 1, pp. 113–116.

H. Gueddah, A. Yousfi, and M. Belkasmi, (2016), “The filtered combination of the weighted edit distance and the Jaro-Winkler distance to improve spell checking Arabic texts,” IEEE/ACS International Conference on Computer Systems and Applications, AICCSA, vol. 2016-July, pp. 1–6.

H. Muaidi and R. Al-Tarawneh, (2012), “Towards Arabic Spell-Checker Based on N-Grams Scores,” International Journal of Computer Applications, vol. 53, no. 3, pp. 975–8887.

M. M. Al-Jefri and S. A. Mahmoud, (2015), “Context-Sensitive Arabic Spell Checker Using Context Words and N-Gram Language Models,” in Proceedings of Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences, NOORIC, pp. 258–263.

K. Bacha and M. Zrigui, (2016), “Contribution to the Achievement of a Spell checker for Arabic,” Research in Computing Science, vol. 117, no. January, pp. 161–172.

Y. Hassan, M. Aly, and A. Atiya, (2014), “Arabic Spelling Correction using Supervised Learning,” the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), pp. 121–126.

A. O. Al-Shbail and M. A. B. Diab, (2018), “Arabic Writing, Spelling Errors and Methods of Treatment,” Journal of Language Teaching and

SUST Journal of Engineering and Computer Sciences (JECS), Vol. 21, No. 1, 2020

Research, vol. 9, no. 5, p. 1026.

A. Martinench, (2014), “The Formation of Nominal Derivatives in the Arabic Language With a View to Computational Linguistics,” University of Salford.

T. Zerrouki and A. Balla, (2009), “Implementation of infixes and circumfixes in the spell checkers,” in the Second International Conference on Arabic Language Resources and Tools, pp. 61–65.

B. Haddad and M. Yaseen, (2007), “Detection and Correction of Non-Words in Arabic: A Hybrid Approach,” International Journal of Computer Processing Of Languages, vol. 20, no. 04, p. 237.

N. Gupta and P. Mathur, (2012), “Spell Checking Techniques in NLP: A Survey,” International Journal of Advanced Research in Computer Science and Software Engineering, vol. 2, no. 12, pp. 2277–128.

A. Protopapas, A. Fakou, S. Drakopoulou, C. Skaloumbakas, and A. Mouzaki, (2013), “What do spelling errors tell us? Classification and analysis of errors made by Greek schoolchildren with and without dyslexia,” Reading and Writing, vol. 26, no. 5, pp. 615–646.

V. V Bhaire, A. A. Jadhav, and P. A. Pashte, (2015), “Spell checker,” International Journal of Scientific and Research Publications, vol. 5, no. 4, pp. 5–7.

A. A. M. Mahdi, (2012), “Spell Checking and Correction for Arabic Text Recognition,” King fahd university of petroleum & minerals.

A. M. Azmi, M. N. Almutery, and H. A. Aboalsamh, (2019), “Real-Word Errors in Arabic Texts: A Better Algorithm for Detection and Correction,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 27, no. 8, pp. 1308–1320.

A. S. Lhoussain, G. Hicham, and Y. Abdellah, (2015), “Adaptating the levenshtein distance to contextual spelling correction,” International Journal of Computer Science and Applications, vol. 12, no. 1, pp. 127–133.

B. Mohit, A. Rozovskaya, N. Habash, W. Zaghouani, and O. Obeid, (2015), “The First QALB Shared Task on Automatic Text Correction for Arabic,” the EMNLP 2014 Workshop on Arabic Natural Langauge Processing (ANLP), pp. 39–47.

T. Zerrouki, K. Alhawaity, and A. Balla, (2014), “Autocorrection Of Arabic Common Errors For Large Text Corpus QALB-2014 Shared Task,” EMNLP 2014 Workshop on Arabic Natural Language Processing, no. 2005, pp. 127–131.

N. AlShenaifi, R. AlNefie, M. Al-Yahya, and H. Al-Khalifa, (2015), “Arib @ QALB-2015 Shared Task: A Hybrid Cascade Model for Arabic Spelling Error Detection and Correction,” Proceedings of the Second Workshop on Arabic Natural Language Processing, pp. 127–132.

M. Attia, M. Al-badrashiny, and M. Diab, (2015), “Priming Spelling Candidates with Probability,” in Proceedings of the Second Workshop on Arabic Natural Language Processing, vol. 10.18, no. January.

M. I. Alkanhal, M. A. Al-Badrashiny, M. M. Alghamdi, and A. O. Al-Qabbany, (2012), “Automatic stochastic arabic spelling correction with emphasis on space insertions and deletions,” IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 7, pp. 2111–2122.

K. Shaalan, M. Attia, P. Pecina, Y. Samih, and J. van Genabith, (2012), “Arabic Word Generation and Modelling for Spell Checking,” the Eight International Conference on Language Resources and Evaluation, pp. 719–725.

H. Gueddah and A. Yousfi, ( 2013), “The impact of arabic inter-character proximity and similarity on spell-checking,” 8th International Conference on Intelligent Systems: Theories and Applications, Rabat, Morocco.

P. Christen, (2006), “A Comparison of Personal Name Matching: Techniques and Practical Issues,” Sixth IEEE International Conference on Data Mining - Workshops (ICDMW’06), pp. 290–294.

R. Al-Tarawneh, H. S. A. Hamatta, H. Muiadi, P. Abdullah, and B. Ghazi, (2014), “Novel Approach for Arabic Spell-Checker: Based on Radix Search Tree,” International Journal of Computer Applications, vol. 95, no. 7, pp. 975–8887.

H. F. Alshahad, (2018), “Arabic Spelling Checker Algorithm for Speech Recognition,” International Journal of Computer Science and Information Security (IJCSIS), vol. 15, no. 12, pp. 228–235.

M. Attia, P. Pecina, Y. Samih, K. Shaalan, and J. Van Genabith, (2012), “Improved Spelling Error Detection and Correction for Arabic,” Natural Language Engineering, vol. 22, no. 5, pp. 103–112.

N. Mohammed and Y. Abdellah, (2018), “The vocabulary and the morphology in spell checker,” in The First International Conference on Intelligent Computing in Data Sciences, vol. 127, pp. 76–81.

A. El Oualkadi, F. Choubani, and A. El Moussati, (2016), “A lightweight system for correction of Arabic derived words,” in Mediterranean Conference on Information & Communication Technologies, vol. 380, pp. 131–138.

H. Bouamor, H. Sajjad, N. Durrani, and K. Oflazer, (2015), “Shared Task: Combining Character level MT and Error-tolerant Finite-State Recognition for Arabic Spelling Correction,” the Second Workshop on Arabic Natural Language Processing, pp. 144–149.

M. Attia, P. Pecina, and A. Toral, (2011), “An open-source finite state morphological

SUST Journal of Engineering and Computer Sciences (JECS), Vol. 21, No. 1, 2020

transducer for modern standard Arabic,” in Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing, pp. 125–133.

K. Shaalan, A. Allam, and A. Gomah, (2003), “Towards automatic spell checking for Arabic,” in Proceedings of the Fourth Conference on Language Engineering, Egyptian Society of Language Engineering (ELSE), Egypt, no. May, pp. 240–247.

S. C. Wayland et al. , (2010), “Finding Entries in an On-line Arabic Dictionary,” Human-Computer Interaction Lab 27th Annual Symposium, pp. 1–2.

K. Shaalan, R. Aref, anda. Fahmy , (2010), “An approach for analyzing and correcting spelling errors for non-native Arabic learners,” The 7th International Conference of Informatics and Systems.

H. M. Noaman, S. S. Sarhan, and M. A. A. Rashwan, (2016), “Automatic Arabic Spelling Errors Detection and Correction Based on Confusion Matrix- Noisy Channel Hybrid System,” Journal of Theoretical and Applied Information Technology, vol. 40, no. 2, pp. 54–

Miniwatts Marketing Group, Arabic Speaking Internet Users Statistics, Internet World State usage Population Statistcs, June 30, (2017). Accessed on: Mar 3, 2019. [Online]. Available: https:/https://www.internetworldstats.com/stats19.htm

Miniwatts Marketing Group, Arabic Speaking Internet Users Statistics, Internet World State usage Population Statistcs, June 30, (2017). Accessed on: Mar 3, 2019.

Satu Limaye, ISLAM in ASIA, Asia-Pacific Center for Security Studies, April 16, (1999).

"Levenshtein distance,", (2019). Accessed on: May 30, 2019.

"Jaro–Winkler distance," May 30, (2019). Accessed on: May 30, 2019.

"Soundex," June 14, (2019). Accessed on: June 14, 2019.

Christopher D. Manning Prabhakar Raghavan Hinrich Schütze, (2009), “An introduction to Information retrieval”, Cambridge University Press Cambridge, England.

TABLE1: ISOLATED-WORD STUDIES OF ARABIC SPELL CHECKING Work Used dataset Used techniques

Alshahad, 2018 [27]

similarity techniques

Nejja and Yousfi , 2018 [29]

Sub-dictionaries

Morphology and similarity techniques

Hicham Gueddah et al., 2016 [3]

Learning corpus

Similarity techniques

Mohammed Attia et al., 2012 [28]

Arabic Gigaword Corpus, and news articles crawled from the Al-Jazeera website.

Hybrid techniques

Noaman et al., 2016 [36]

QALP corpus, and confusion matrix

Hybrid techniques

Nejja Mohammeda and Yousfi Abdellah, 2016 [30]

A corpus (containing 10000 word) constituted of surface patterns and roots characterized

Morphology and similarity techniques

Mohammed Attia et al., 2012 [28]

A dictionary of 9.3 million fully inflected Arabic words

Similarity, and Rule-based techniques

Bouamor et al., 2015 [31]

QALB corpus, AraComLex, and MADAMIRA

Hybrid techniques

Mohammed Attia et al., 2015 [21]

QALB corpus, Conditional Random Field (CRF), MADAMIRA morphological, and AraComLex Extended

Rule-based techniques

AlShenaifi et al., 2015 [20]

QALB corpus, KSU corpus, Arabic Corpora (OSAC),Al-Sulaiti Corpus, KACST Arabic Corpus, and MADAMIRA

Rule-based, and similarity techniques

Mohammed Attia et al., 2015 [21]

Arabic Gigaword Corpus, and a corpus crawled from Al-Jazeera

Rule-based techniques

Aouragh Si Lhoussain et al., 2015 [17]

Similarity techniques

Youssef Hassan et al., 2014 [7]

QALB corpus,AraComLex2,MADAMIRA3, and Confusion matrix.

Rule-based, morphology, and similarity techniques

Al-Tarawneh et al., 2014 [26]

Muaidi Corpus

Similarity techniques

Zerrouki et al., 2014 [19]

QALB-2014 corpus and replacement list

Rule-based techniques

Gueddah Hicham et al., 2013 [1]

Set of Arabic documents typed by four expert users.

Similarity techniques

Hicham Gueddah and Abdallah Yousfi, 2013 [24]

Typing test of a training corpus

Similarity techniques

SUST Journal of Engineering and Computer Sciences (JECS), Vol. 21, No. 1, 2020

Work Used dataset Used techniques

Muaidi & Al-Tarawneh, 2012 [4]

Muaidi Corpus

Similarity techniques

Mohamed Alkanhalet al., 2012 [22]

A standard Arabic text corpus and test data (cover all types of spelling errors)

Rule-based techniques

Khaled Shaalan et al., 2012 [23]

Hybrid techniques

Mohammed Attia et al., 2011 [32]

AraComLex, and a corpus of 1,089,111,204 words

Morphology techniques

Wayland et al., 2010 [34]

Arabic electronic dictionaries and confusion matrices

Similarity techniques

Khaled Shaalan et al., 2010 [35]

Rule-based, and similarity technique

Khaled Shaalan et al., 2003 [33]

Rule-based techniques

TABLE 2: CONTEXT-SENSITIVE STUDIES OF ARABIC SPELL CHECKING Work Used dataset Used techniques

Azmi et al., 2019 [16]

KSU, ANC-KACST, and JM corpus.

Morphology and similarity techniques

Majed Al-Jefri and Sabri Mahmoud, 2015 [5]

Corpus from Al-Riyadh newspaper articles on three topics, in addition confusion sets (OCR) misrecognized words

Similarity, and relying on phonetics techniques

Mohammed Attia et al., 2015 [21]

Arabic Gigaword Corpus, and a corpus crawled from Al-Jazeera

Rule-based techniques