Arabic Natural Language Processing

Arabic poses a lot of challenges to Natural Language Processing (NLP). Arabic is both morphologically rich and highly ambiguous. In Modern Standard Arabic (MSA), a complete part-of-speech tag set has over 300,000 tags (whereas English has about 50), and MSA words have 12 morphological analyses on average (English has 1.25 POS tags per word on average). The high ambiguity is primarily the result of Arabic orthography, which almost always omits the diacritics used to specify short vowels and consonantal doubling.

Furthermore, Arabic has complex morpho-syntactic agreement rules and a lot of irregular forms: over half of Arabic plurals are irregular (“broken plurals”). Finally, Arabic has a large number of dialectal variants that are as different from MSA as romance languages are different from Latin. MSA is the official form of Arabic, but is no one’s mother tongue. The dialects, the true mother tongues, are primarily spoken, do not have written standards, and have very limited resources.

The following are the multiple projects in CAMeL Lab that address these challenges for Arabic by sub category:

Arabic and Arabic Dialect Orthography

Project Name Description
Qatar Arabic Language Bank (QALB) Project CAMeL researchers collaborated with Carnegie Mellon University Qatar on the QALB Project, which manually corrected 2M words of unedited Arabic for spelling and grammar mistakes. The QALB corpus was part of two international shared task competitions.
Conventional Orthography for Dialectal Arabic (CODA) CAMeL researchers, in collaboration with researchers in a number of universities, have developed a Conventional Orthography for Dialectal Arabic — a computational “standard” for writing Arabic dialects, so far including Egyptian, Levantine, Tunisian, and Gulf Arabic.
Arabizi Transliteration Demo
3arrib Project
CAMeL researchers, in collaboration with researchers Columbia University and George Washington University, developed a system for automatic transliteration from Arabizi (Romanized Arabic) to Arabic script. 

Arabic and Arabic Dialect Morphological Analysis and Disambiguation

Project Name Description
MADAMIRA   CAMeL collaborates actively with Columbia University and George Washington University on the development and improvement of the state-of-the-art Arabic Morphological tagger, MADAMIRA for Standard and Dialectal Arabic.
YAMAMA
YAMAMA (Yet Another Multi- Dialect Arabic Morphological Analyzer; Arabic يمامة ‘Barbary Dove’), is a multi-dialect Arabic morphological analyzer and disambiguator, that is five times faster than the state-of-the-art MADAMIRA with a slightly lower quality.
CALIMA CAMeL researchers have been collaborating with Columbia University a family of morphological analyzers and generators for Standard and Dialectal Arabic.
DALILA CAMeL researchers developed a Chrome Extension tool that supports learning Arabic and Arabic dialects by providing translations and word analysis for words on any web page.
Download DALILA

Arabic Syntactic Analysis

Project Name Description
PalmTreeBank CAMeL researchers are working on the development of a linguistic dependency treebank for a number of less studied genres in Arabic.
CAMeLParser
CAMeL researchers are also working on improving the quality of Arabic syntactic analysis.
Download CAMeLParser
NYUAD Arabic Universal Dependency Treebank CAMeL researchers are currently developing a treebank of texts annotated in the Universal Dependency syntactic representation. The treebank will be available as part of the UD v2.0 release in March 2017.

Arabic Dialect Corpora

Project Name Description
Gumar Corpus
CAMeL researchers have collected a 100M word corpus of Gulf Arabic, named the GUMAR Corpus. CAMeL Researchers are working on the automatic annotation of the corpus and on developing gold data annotations for portions of it. Work on GUMAR will lead to improved tools for Gulf Arabic automatic analysis.
Curras Corpus
CAMeL researchers collaborated with Birzeit Unviersity’s Curras Project to create and annotate a corpus of 50K words of Palestinian Arabic.
Multi-Arabic Dialect Applications and Resources (MADAR) CAMeL collaborates actively with Carnegie Mellon University Qatar on the MADAR project (Multi-Arabic Dialect Applications and Resources). This is the largest project of its kind planning to collect dialectal resources from 25 cities across the Arab World and develop new data sets and tools for Arabic dialect identification and machine translation.
Other Dialectal Corpora CAMeL researchers in collaboration with researchers in Columbia University, Universität Leipzig, and Yale University, have developed two annotated corpora and analyzers for Moroccan and Sanaani Yemeni Arabic.

Arabic Sentiment Analysis

Project Name Description
Opinion Mining for Arabic (OMA) CAMeL researchers are actively collaborating with researchers at the American University in Beirut and Qatar University on the development of advanced methods for Arabic Sentiment Analysis.

Arabic Readability and Text Simplification

Project Name Description
Simplification of Arabic Masterpieces for Extensive Reading (SAMER)
Simplification of Arabic Masterpieces for Extensive Reading (SAMER) is a research project addressing the severe dearth of graded readers in Arabic fiction by creating a standard and tools for the simplification of fictional works to school-age learners.