Arabic Natural Language Processing

Arabic poses a lot of challenges to Natural Language Processing (NLP). Arabic is both morphologically rich and highly ambiguous. In Modern Standard Arabic (MSA), a complete part-of-speech tag set has over 300,000 tags (whereas English has about 50), and MSA words have 12 morphological analyses on average (English has 1.25 POS tags per word on average). The high ambiguity is primarily the result of Arabic orthography, which almost always omits the diacritics used to specify short vowels and consonantal doubling.

Furthermore, Arabic has complex morpho-syntactic agreement rules and a lot of irregular forms: over half of Arabic plurals are irregular (“broken plurals”). Finally, Arabic has a large number of dialectal variants that are as different from MSA as romance languages are different from Latin. MSA is the official form of Arabic, but is no one’s mother tongue. The dialects, the true mother tongues, are primarily spoken, do not have written standards, and have very limited resources.

The following are the multiple projects in CAMeL Lab that address these challenges for Arabic by sub category:

Arabic and Arabic Dialect Orthography

Arabic poses a lot of challenges to Natural Language Processing (NLP). Arabic is both morphologically rich and highly ambiguous. In Modern Standard Arabic (MSA), a complete part-of-speech tag set has over 300,000 tags (whereas English has about 50), and MSA words have 12 morphological analyses on average (English has 1.25 POS tags per word on average). The high ambiguity is primarily the result of Arabic orthography, which almost always omits the diacritics used to specify short vowels and consonantal doubling. Furthermore, Arabic has complex morpho-syntactic agreement rules and a lot of irregular forms: over half of Arabic plurals are irregular (“broken plurals”). Finally, Arabic has a large number of dialectal variants that are as different from MSA as romance languages are different from Latin. MSA is the official form of Arabic, but is no one’s mother tongue. The dialects, the true mother tongues, are primarily spoken, do not have written standards, and have very limited resources. The following are the multiple projects in CAMeL Lab that address these challenges for Arabic by sub category:

Arabic and Arabic Dialect Morphological Analysis and Disambiguation

Arabic Syntactic Analysis

Arabic Dialect Corpora

Arabic Readability and Text Simplification

Machine Translation

Most of the research on machine translation in CAMeL is focused on Arabic as source or target language. We list below three ongoing efforts.

Arabic Sentiment Analysis

Informational Retrieval

Capstone Projects

Other