Research
CAMeL Lab (Computational Approaches to Modeling Language) is a research lab at New York University Abu Dhabi established in September 2014. CAMeL's mission is research and education in artificial intelligence, specifically focusing on natural language processing, computational linguistics, and data science. The main lab research areas are Arabic natural language processing, machine translation, sentiment analysis and dialogue systems.
Computational Modeling of Arabic Orthography
Arabic orthography (or spelling) poses many challenges for computational processing. Standard Arabic orthography marks short vowels and consonant doubling using diacritical marks, which are omitted far more often than not. This results in a high degree of ambiguity. Furthermore, unedited Standard Arabic has many spelling errors (one of four words on average according to the QALB project). Dialectal Arabic is even more challenging since there are no official standard for spelling words in the Arabic script. As such, Arabic native speakers writing in their dialects show a lot of inconsistency and sometimes even write in romanization. The research in CAMeL lab addresses these issues and more.
-
CAMeL researchers, in collaboration with researchers in a number of universities, have developed a CODA (Conventional Orthography for Dialectal Arabic) — a computational “standard” for writing Arabic dialects. Earlier versions of CODA targeted specific Arabic dialects: Egyptian, Palestinian, Tunisian, Algerian, and Gulf. In its most recent iteration, the guidelines for CODA-Star (as in for any dialect) cover 28 city dialects.
-
CAMeL researchers, in collaboration with researchers Columbia University and George Washington University, developed a system for automatic transliteration from Arabizi (Romanized Arabic) to Arabic script.
-
Qusasat is a crowd-sourcing solution for Arabic text digitization. This application was developed for the 2016 NYUAD Hackathon for Social Good in the Arab World by a team of student participants and CAMeL researchers. This application won both First Place and Audience Choice award.
-
ArabScribe was a senior capstone project interested in addressing the problem of language learners looking up unfamiliar words that they heard. The project created an improved phonetic search algorithm, and a data set of 10,000 impressions on native and non-native Arabic speakers of what a word is based on its audio alone.
Publication
Zhang, Lingliang, Nizar Habash, and Godfried Toussaint. "Robust Dictionary Lookup in Multiple Noisy Orthographies." Proceedings of the Third Arabic Natural Language Processing Workshop. 2017.
-
CAMeL researchers collaborated with Carnegie Mellon University Qatar on the QALB Project, which manually corrected 2 million words of unedited Arabic for spelling and grammar mistakes. The QALB corpus was part of two international shared task competitions. Most recently, CAMeL lab researchers have been developing new models for automatic spelling correction using this corpus.
Computational Modeling of Arabic Morphology
Arabic morphology is rich, complex, and highly ambiguous. In Modern Standard Arabic (MSA), a complete part-of-speech tag set has over 300,000 tags (whereas English has about 50), and MSA words have 12 morphological analyses on average (English has 1.25 POS tags per word on average). While the high ambiguity is primarily the result of Arabic ambiguous orthography, Arabic uses many attached particles that add to the space of possible readings. For example, a word like وحدة can mean ‘unity’, ‘loneliness’, and ‘unit of measure’, if treated as a single base word, but it can also be interpreted as و+حدة ‘and intensity’. The research in CAMeL Lab addresses the problems of morphological analysis (identifying all the possible readings of a word out of context), morphological generation (generating a word given its analysis), morphological disambiguation (identifying the word’s correct reading in context), and morphological annotation (manually identifying the word’s correct reading in context to build a data set to training machine learning model for morphological disambiguation).
-
CAMeL researchers have been collaborating with Columbia University a family of morphological analyzers and generators for Standard and Dialectal Arabic.
- CALIMA-STAR
- CALIMA Gulf
- CALIMA-STAR
A rich Arabic morphological analyzer and generator that provides functional and form-based morphological features as well as built-in tokenization, phonological representation, lexical rationality, reinflection, and much more.
- CALIMA Gulf
The Gulf Arabic morphological analyzer (CALIMAGLF) currently covers over 2,600 verbal lemmas. It models fully inflected paradigms and orthographic varieties.
-
- MADAMIRA
- YAMAMA
- MADAMIRA
CAMeL collaborates actively with Columbia University and George Washington University on the development and improvement of the state-of-the-art Arabic Morphological tagger, MADAMIRA for Standard and Dialectal Arabic.
- YAMAMA
YAMAMA (Yet Another Multi- Dialect Arabic Morphological Analyzer; Arabic for ‘Barbary Dove’), is a multi-dialect Arabic morphological analyzer and disambiguator that is five times faster than the state-of-the-art MADAMIRA with a slightly lower quality.
-
- MADARi
- MADARi: Interface for Morphological Annotation of Arabic and its Dialects
The MADARi annotation interface is a joint morphological annotation and spelling correction system for texts in Standard and Dialectal Arabic. The MADARi framework provides intuitive interfaces for annotating text and managing the annotation process of a large number of sizable documents.
-
CAMeL researchers developed a Chrome Extension tool that supports learning Arabic and Arabic dialects by providing translations and word analysis for words on any web page. This application was developed as part of the NYUAD Hackathon for Social Good in the Arab World (2015)
Computational Modeling of Arabic Syntax
Researchers in CAMeL lab have been working on improving models of Arabic syntactic analysis through the development of new treebanks (databases of syntactic analyses) and new systems for syntactic parsing.
-
The CamelParser is an Arabic syntactic dependency parser. It produces morphologically enriched Columbia Arabic Treebank (CATiB) syntactic representations.
Shahrour, Anas, et al. "Camelparser: A System for Arabic Syntactic Analysis and Morphological Disambiguation." Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations. 2016.
-
CAMeL researchers are currently developing a treebank of texts annotated in the Universal Dependency syntactic representation. The NYUAD Arabic UD treebank (NUDAR) is based on the Penn Arabic Treebank (PATB), parts 1, 2, and 3, through conversion to CATiB dependency trees.
-
Palmyra is a platform-independent graphical tool for syntactic dependency annotation supporting languages that require complex morphological tokenization.
Dima Taji and Nizar Habash. PALMYRA 2.0: A Configurable Multilingual Platform Independent Tool for Morphology and Syntax Annotation. In Proceedings of Universal Dependencies Workshop (UDW) 2020.
Talha Javed, Nizar Habash, and Dima Taji. 2018. Palmyra: A Platform Independent Dependency Annotation Tool for Morphologically Rich Languages. In Proceedings of LREC 2018.
-
CAMeL researchers collaborated with Carnegie Mellon University and the American University of Science and Technology Lebanon to produce a corpus of manually and synthetically constructed syntactic analyses of questions in Arabic.
Collection, Creation, and Annotation of Arabic Corpora
Data is extremely central to the development of artificial intelligence and natural language processing systems. Data come in many forms: monolingual, bilingual (as in translated parallel texts), multilingual and multi-dialectal, and annotated corpora for a range of possible features. Below is a list of the various projects on data collection, creation, and annotation at CAMeL Lab.
-
CAMeL researchers have collected a 100M word corpus of Gulf Arabic, named the GUMAR Corpus. CAMeL Researchers are working on the automatic annotation of the corpus and on developing gold data annotations for 200,000 words from eight novels. Work on GUMAR will lead to improved tools for Gulf Arabic automatic analysis.
Khalifa, Salam, et al. "A Morphologically Annotated Corpus of Emirati Arabic." LREC. 2018.
-
The MADAR project (Multi-Arabic Dialect Applications and Resources) is a three year project that is a collaboration between CAMeL lab, the NLP group at Carnegie Mellon University in Qatar and Columbia University.
The MADAR Corpus is a collection of parallel sentences covering the dialects of 25 cities from the Arab World, in addition to English, French, and Modern Standard Arabic (MSA). This corpus is created by translating selected sentences from the Basic Traveling Expression Corpus (BTEC) in French and English to the different dialects.
The MADAR Corpus is made available to the research community under a non-commercial license.
-
The Camel Treebank (CamelTB) is an open-source dependency treebank of Modern Standard and Classical Arabic.
CamelTB 1.0 includes 13 sub-corpora (188K words) comprising selections of texts from pre-Islamic poetry to social media online commentaries, and covering a range of genres from religious and philosophical texts to news, novels, and student essays.
-
CAMeL researchers collaborated with Birzeit University’s Curras Project to create and annotate a corpus of 50K words of Palestinian Arabic.
-
CAMeL researchers, funded by a generous grant from NYUAD, are creating an Arabic translation of a portion of the European Parliamentary Proceedings (EuroParl), thus making available a large scale development and test set to support research in translation between Arabic and European languages.
-
Researchers in CAMeL lab worked on developing the Japanese-Arabic section of the TUFS Media Corpus, which comprises a parallel corpus of news articles collected at Tokyo University of Foreign Studies (TUFS).
Arabic Text Analytics
Researchers in CAMeL lab have been working on improving models of Arabic text analytics. Specifically, we work on Arabic readability, Arabic sentiment analysis, and also Arabic dialect identification.
-
SAMER: Simplification of Arabic Masterpieces for Extensive Reading
SAMER is a research project addressing the severe dearth of graded readers in Arabic fiction by creating a standard and tools for the simplification of fictional works to school-age learners.
-
Researchers from camel lab collaborated with researchers from American University of Bierut and Qatar University on the opinion mining for Arabic (OMA Project).
Badaro, Gilbert, Ramy Baly, Hazem Hajj, Nizar Habash, and Wassim El-Hajj. "A large scale Arabic sentiment lexicon for Arabic opinion mining." In Proceedings of the EMNLP 2014 workshop on Arabic natural language processing (ANLP), pp. 165-173. 2014.
Dialogue Systems
Researchers in CAMeL lab have been working on developing chatbots in English, Arabic, and dialectal Arabic.
-
As an interactive human avatar dialogue system, TOIA (time-offset interaction application) is a bilingual (Arabic-English) conversational agent, similar to a chat bot, except that the avatar is based on a pre-recording of an actual human being. The system is designed to allow anybody, simply using a laptop, to create an avatar of themselves. As an interactive tool, TOIA can serve as a conversational medium of story telling, thus enabling cross-cultural and cross-generational sharing and preservation of stories.
Contact Nizar Habash if interested in getting access to the tool.
-
BOTTA is the first Arabic dialect chatbot, exploring the challenges of creating a conversational agent that aims to simulate friendly conversations using the Egyptian Arabic dialect.
Dana Abu Ali and Nizar Habash. 2016. Botta: An Arabic Dialect Chatbot. In Proceedings of COLING 2016, System Demonstrations, Osaka, Japan.
Machine Translation
Machine translation is the task of automatically mapping text in one language to another. Besides data collection and creation mentioned above, researchers at CAMeL Lab are investigating how to improve machine translation for a number of language pairs that are less studied using statistical and neural techniques.
-
Researchers at camel lab benchmarked translations from Arabic to 22 European languages and vice-versa as part of the creation of the Arab-Aquis corpus.
Arabic <> Hungarian, Finnish, Estonian, Lithuanian, Latvian, German, Danish, Dutch, Swedish, Greek, Slovak, Czech, Polish, Bulgarian, Slovene, Romanian, Portuguese, Italian, Spanish, French, Maltese, English.
Habash, Nizar, Nasser Zalmout, Dima Taji, Hieu Hoang, and Maverick Alzate. "A parallel corpus for evaluating machine translation between Arabic and European languages." In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, vol. 2, pp. 235-241. 2017.
-
Arabic > Japanese
Noll, Ella, Mai Oudah, and Nizar Habash. "Simple Automatic Post-editing for Arabic-Japanese Machine Translation." arXiv preprint arXiv:1907.06210 (2019).
-
Arabic > English, French, Spanish, Russian and Chinese
Zalmout, Nasser, and Nizar Habash. "Optimizing tokenization choice for machine translation across multiple target languages." The Prague Bulletin of Mathematical Linguistics108.1 (2017): 257-269.
-
Erdmann, Alexander, Nizar Habash, Dima Taji, and Houda Bouamor. "Low resourced machine translation via morpho-syntactic modeling: the case of dialectal Arabic." arXiv preprint arXiv:1712.06273 (2017).
-
Kholy, Ahmed El, and Nizar Habash. "Morphological Constraints for Phrase Pivot Statistical Machine Translation." arXiv preprint arXiv:1609.03376 (2016).
-
Qutr is a smart cross-lingual communication application for the travel domain. It is a real-time messaging app that automatically translates conversations while supporting keyword-to-sentence matching.
Other Projects
In addition to all of the above, there are other smaller projects in CAMeL Lab.
-
The goal of this work is to build speech-based search engines for low resource languages. There are several challenges in building such engines — this project focuses on two: mitigating the verbosity of spoken queries, and utilizing methods of speech processing that do not require a language model.
-
Aiming towards a semantic representation of cooking recipes, Simplified Ingredient Merging Map in Recipes (SIMMR) proposes an ingredient-instruction dependency tree, and contains an automatic recipe parser and database.