Arabic Natural Language Processing

Arabic poses a lot of challenges to Natural Language Processing (NLP). Arabic is both morphologically rich and highly ambiguous. In Modern Standard Arabic (MSA), a complete part-of-speech tag set has over 300,000 tags (whereas English has about 50), and MSA words have 12 morphological analyses on average (English has 1.25 POS tags per word on average). The high ambiguity is primarily the result of Arabic orthography, which almost always omits the diacritics used to specify short vowels and consonantal doubling.

Furthermore, Arabic has complex morpho-syntactic agreement rules and a lot of irregular forms: over half of Arabic plurals are irregular (“broken plurals”). Finally, Arabic has a large number of dialectal variants that are as different from MSA as romance languages are different from Latin. MSA is the official form of Arabic, but is no one’s mother tongue. The dialects, the true mother tongues, are primarily spoken, do not have written standards, and have very limited resources.

There are multiple projects in CAMeL Lab that address these challenges for Arabic:

Arabic and Arabic Dialect Orthography

  • QALB Project
    CAMeL researchers collaborated with Carnegie Mellon University Qatar on the Qatar Arabic Language Bank (QALB) Project, which manually corrected 2M words of unedited Arabic for spelling and grammar mistakes. The QALB corpus was part of two international shared task competitions.

  • CODA Project
    CAMeL researchers, in collaboration with researchers in a number of universities, have developed a Conventional Orthography for Dialectal Arabic — a computational “standard” for writing Arabic dialects, so far including Egyptian, Levantine, Tunisian, and Gulf Arabic.

  • 3arrib Project
    CAMeL researchers, in collaboration with researchers Columbia University and George Washington University developed a system for automatic transliteration from Arabizi (Romanized Arabic) to Arabic script. A demo of the system can be found here.

Arabic and Arabic Dialect Morphological Analysis and Disambiguation

    CAMeL collaborates actively with Columbia University and George Washington Unviersity on the development and improvement of the state-of-the-art Arabic Morphological tagger, MADAMIRA for Standard and Dialectal Arabic.

    YAMAMA (Yet Another Multi- Dialect Arabic Morphological Analyzer; Arabic يمامة ‘Barbary Dove’), is a multi-dialect Arabic morphological analyzer and disambiguator, that is five times faster thatn the state-of-the-art MADAMIRA with a slightly lower quality.
    CAMeL researchers have been collaborating with Columbia University a family of morphological analyzers and generators for Standard and Dialectal Arabic.

    CAMeL researchers developed a Chrome Extension tool that supports learning Arabic and Arabic dialects by providing translations and word analysis for words on any webpage. You can download the tool here.

Arabic Syntactic Analysis

  • PalmTreeBank
    CAMeL researchers are working on the development of a linguistic dependency treebank for a number of less studied genres in Arabic.

  • CAMeLParser
    CAMeL researchers are also working on improving the quality of Arabic syntactic analysis. You can download the parser here.

  • NYUAD Arabic Universal Dependency Treebank
    CAMeL researchers are currently developing a treebank of texts annotated in the Unviersal Depedendency syntactic representation. The treebank will be available as part of the UD v2.0 release in March 2017.

Arabic Dialect Corpora

  • Gumar Corpus
    CAMeL researchers have collected a 100M word corpus of Gulf Arabic, named the GUMAR Corpus. CAMeL Researchers are working on the automatic annotation of the corpus and on developing gold data annotations for portions of it. Work on GUMAR will lead to improved tools for Gulf Arabic automatic analysis.

  • Curras Corpus
    CAMeL researchers collaborated with Birzeit Unviersity’s Curras Project to create and annotate a corpus of 50K words of Palestinian Arabic.

    CAMeL collaborates actively with Carnegie Mellon University Qatar on the MADAR project (Multi-Arabic Dialect Applications and Resources). This is the largest project of its kind planning to collect dialectal resources from 25 cities across the Arab World and develop new data sets and tools for Arabic dialect identification and machine translation.

  • Other Dialectal Corpora
    CAMeL researchers in collaboration with researchers in Columbia University, Universität Leipzig, and Yale University, have developed two annotated corpora and analyzers for Moroccan and Sanaani Yemeni Arabic.

Arabic Sentiment Analysis

  • CAMeL researchers are actively collaborating with researchers at the American University in Beirut and Qatar University on the development of advanced methods for Arabic Sentiment Analysis. You can find more on the OMA-Project here.

Arabic Readability and Text Simplification

    Simplification of Arabic Masterpieces for Extensive Reading (SAMER) is a research project addressing the severe dearth of graded readers in Arabic fiction by creating a standard and tools for the simplification of fictional works to school-age learners. SAMER is led by a team of specialists in Arabic linguistics, computer science, literature, and K12 education. The project will build a corpus of all the texts in Arabic used in teaching and learning in the official school curricula of the United Arab Emirates; analyze it for lexical, morphological, and syntactic features; and generate a Graded Reader Scale (GRS) mirroring readability levels in the primary, preparatory, and secondary curricula to assist in the simplification process. The project includes a competition to draw in the best talents among students and specialists of Arabic from around the Arab World to simplify some of the major works in Arabic fiction in line with the GRS levels and guidelines.

Machine Translation

Most of the research on machine translation in CAMeL is focused on Arabic as source or target language. We list below three ongoing efforts.

  • AraParl
    CAMeL researchers, funded by a generous grant from NYUAD, are creating an Arabic translation of a portion of the European Parliamentary Proceedings (EuroParl), thus created a large scale development and test set to support research in translation between Arabic and languages of Europe.

  • Pivot Machine Translation for Morphologically Rich Languages
    Many languages do not have the necessary large-scale parallel corpora to allow building statistical machine translation systems. As such, in practice machine translation is done by pivoting through English. Since English is a poor language morphologically, this causes a reduction in the quality in the translation especially when translating between two morphologically rich languages. In collaboration with Columbia University, we have developed techniques for improving the quality of pivot machine translation and demonstrated it on Hebrew-Arabic and Persian-Arabic.

  • Arabic Dialect Machine Translation
    CAMeL researchers are working in collaboration with researchers in Columbia University on the problem of translating between Arabic dialects and English by exploiting standard Arabic resources.

Information Retrieval

Other Research in NLP/CL