Research

CAMeL Lab (Computational Approaches to Modeling Language) is a research lab at New York University Abu Dhabi established in September 2014. CAMeL's mission is research and education in artificial intelligence, specifically focusing on natural language processing, computational linguistics, and data science. The main lab research areas are Arabic natural language processing, machine translation, sentiment analysis and dialogue systems.

Computational Modeling of Arabic Orthography

Arabic orthography (or spelling) poses many challenges for computational processing. Standard Arabic orthography marks short vowels and consonant doubling using diacritical marks, which are omitted far more often than not. This results in a high degree of ambiguity. Furthermore, unedited Standard Arabic has many spelling errors (one of four words on average according to the QALB project). Dialectal Arabic is even more challenging since there are no official standard for spelling words in the Arabic script. As such, Arabic native speakers writing in their dialects show a lot of inconsistency and sometimes even write in romanization. The research in CAMeL lab addresses these issues and more.

Computational Modeling of Arabic Morphology

Arabic morphology is rich, complex, and highly ambiguous. In Modern Standard Arabic (MSA), a complete part-of-speech tag set has over 300,000 tags (whereas English has about 50), and MSA words have 12 morphological analyses on average (English has 1.25 POS tags per word on average). While the high ambiguity is primarily the result of Arabic ambiguous orthography, Arabic uses many attached particles that add to the space of possible readings. For example, a word like وحدة can mean ‘unity’, ‘loneliness’, and ‘unit of measure’, if treated as a single base word, but it can also be interpreted as و+حدة ‘and intensity’. The research in CAMeL Lab address the problems of morphological analysis (identifying all the possible readings of a word out of context), morphological generation (generating a word given its analysis), morphological disambiguation (identifying the word’s correct reading in context), and morphological annotation (manually identifying the word’s correct reading in context to build a data set to training machine learning model for morphological disambiguation).

Computational Modeling of Arabic Syntax

Researchers in CAMeL lab have been working on improving models of Arabic syntactic analysis through the development of new treebanks (databases of syntactic analyses) and new systems for syntactic parsing.

Collection, Creation, and Annotation of Arabic Corpora

Data is extremely central to the development of artificial intelligence and natural language processing systems. Data come in many forms: monolingual, bilingual (as in translated parallel texts), multilingual and multi-dialectal, and annotated corpora for a range of possible features. Below is a list of the various projects on data collection, creation, and annotation at CAMeL Lab.

Arabic Text Analytics

Researchers in CAMeL lab have been working on improving models of Arabic text analytics. Specifically, we work on Arabic readability, Arabic sentiment analysis, and also Arabic dialect identification.

Dialogue Systems

Researchers in CAMeL lab have been working on developing chatbots in English, Arabic, and dialectal Arabic.

Machine Translation

Machine translation is the task of automatically mapping text in one language to another. Besides data collection and creation mentioned above, researchers at CAMeL Lab are investigating how to improve machine translation for a number of language pairs that are less studied using statistical and neural techniques.

Other Projects

In addition to all of the above, there are other smaller projects in CAMeL Lab.