Learn more about
Arabic Morphological Analysis
Prepare to succeed in a truly global world.
Join the best and the brightest students from more than 120 countries.
Live in a community that feels like home, surrounded by peers from all over the world.
The NYUAD community is conducting research on the vanguard of almost every field.
Discover shows, art exhibitions, and talks at NYU Abu Dhabi.
Exceptional education. World class research. Community-driven.
Arabic poses a lot of challenges to Natural Language Processing (NLP). Arabic is both morphologically rich and highly ambiguous. In Modern Standard Arabic (MSA), a complete part-of-speech tag set has over 300,000 tags (whereas English has about 50), and MSA words have 12 morphological analyses on average (English has 1.25 POS tags per word on average). The high ambiguity is primarily the result of Arabic orthography, which almost always omits the diacritics used to specify short vowels and consonantal doubling.
Furthermore, Arabic has complex morpho-syntactic agreement rules and a lot of irregular forms: over half of Arabic plurals are irregular (“broken plurals”). Finally, Arabic has a large number of dialectal variants that are as different from MSA as romance languages are different from Latin. MSA is the official form of Arabic, but is no one’s mother tongue. The dialects, the true mother tongues, are primarily spoken, do not have written standards, and have very limited resources.
The following are the multiple projects in CAMeL Lab that address these challenges for Arabic by sub category:
Arabic poses a lot of challenges to Natural Language Processing (NLP). Arabic is both morphologically rich and highly ambiguous. In Modern Standard Arabic (MSA), a complete part-of-speech tag set has over 300,000 tags (whereas English has about 50), and MSA words have 12 morphological analyses on average (English has 1.25 POS tags per word on average). The high ambiguity is primarily the result of Arabic orthography, which almost always omits the diacritics used to specify short vowels and consonantal doubling. Furthermore, Arabic has complex morpho-syntactic agreement rules and a lot of irregular forms: over half of Arabic plurals are irregular (“broken plurals”). Finally, Arabic has a large number of dialectal variants that are as different from MSA as romance languages are different from Latin. MSA is the official form of Arabic, but is no one’s mother tongue. The dialects, the true mother tongues, are primarily spoken, do not have written standards, and have very limited resources. The following are the multiple projects in CAMeL Lab that address these challenges for Arabic by sub category:
CAMeL researchers, in collaboration with researchers in a number of universities, have developed a Conventional Orthography for Dialectal Arabic — a computational “standard” for writing Arabic dialects, so far including Egyptian, Levantine, Tunisian, and Gulf Arabic.
CAMeL researchers collaborated with Carnegie Mellon University Qatar on the QALB Project, which manually corrected 2 million words of unedited Arabic for spelling and grammar mistakes. The QALB corpus was part of two international shared task competitions.
CAMeL researchers, in collaboration with researchers Columbia University and George Washington University, developed a system for automatic transliteration from Arabizi (Romanized Arabic) to Arabic script.
Demo
قصاصات (Qusasat) or Arabic Snippets is the application that has been developed for the 2016 NYUAD Hackathon for Social Good in the Arab World. This application won both First Place and Audience Choice award. The application is called قصاصات (Qusasat) or Arabic Snippets. It is a crowd-sourcing solution for Arabic text digitization.
Visit Project Page
CAMeL collaborates actively with Columbia University and George Washington University on the development and improvement of the state-of-the-art Arabic Morphological tagger, MADAMIRA for Standard and Dialectal Arabic.
Learn more about Arabic Morphological Analysis
YAMAMA (Yet Another Multi- Dialect Arabic Morphological Analyzer; Arabic يمامة ‘Barbary Dove’), is a multi-dialect Arabic morphological analyzer and disambiguator, that is five times faster than the state-of-the-art MADAMIRA with a slightly lower quality.
Khalifa, Salam, Nasser Zalmout, and Nizar Habash. "Yamama: Yet another multi-dialect Arabic morphological analyzer." Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations. 2016.
CAMeL researchers have been collaborating with Columbia University a family of morphological analyzers and generators for Standard and Dialectal Arabic.
CAMeL researchers developed a Chrome Extension tool that supports learning Arabic and Arabic dialects by providing translations and word analysis for words on any web page.
CAMeL researchers are working on the development of a linguistic dependency treebank for a number of less studied genres in Arabic.
CAMeL researchers are also working on improving the quality of Arabic syntactic analysis.
Shahrour, Anas, et al. "Camelparser: A system for arabic syntactic analysis and morphological disambiguation." Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations. 2016.
CAMeL researchers are currently developing a treebank of texts annotated in the Universal Dependency syntactic representation.
CAMeL researchers have collected a 100M word corpus of Gulf Arabic, named the GUMAR Corpus. CAMeL Researchers are working on the automatic annotation of the corpus and on developing gold data annotations for portions of it. Work on GUMAR will lead to improved tools for Gulf Arabic automatic analysis.
Khalifa, Salam, et al. "A Morphologically Annotated Corpus of Emirati Arabic." LREC. 2018.
CAMeL researchers collaborated with Birzeit Unviersity’s Curras Project to create and annotate a corpus of 50K words of Palestinian Arabic.
CAMeL collaborates actively with Carnegie Mellon University Qatar on the MADAR project (Multi-Arabic Dialect Applications and Resources). This is the largest project of its kind planning to collect dialectal resources from 25 cities across the Arab World and develop new data sets and tools for Arabic dialect identification and machine translation.
CAMeL researchers in collaboration with researchers in Columbia University, Universität Leipzig, and Yale University, have developed two annotated corpora and analyzers for Moroccan and Sanaani Yemeni Arabic.
Simplification of Arabic Masterpieces for Extensive Reading (SAMER) is a research project addressing the severe dearth of graded readers in Arabic fiction by creating a standard and tools for the simplification of fictional works to school-age learners.
Al-Khalil, Muhamed, et al. "A Leveled Reading Corpus of Modern Standard Arabic." LREC. 2018.
Visit Project Page
Most of the research on machine translation in CAMeL is focused on Arabic as source or target language. We list below three ongoing efforts.
CAMeL researchers, funded by a generous grant from NYUAD, are creating an Arabic translation of a portion of the European Parliamentary Proceedings (EuroParl), thus created a large scale development and test set to support research in translation between Arabic and languages of Europe.
Many languages do not have the necessary large-scale parallel corpora to allow building statistical machine translation systems. As such, in practice machine translation is done by pivoting through English. Since English is a poor language morphologically, this causes a reduction in the quality in the translation especially when translating between two morphologically rich languages. In collaboration with Columbia University, we have developed techniques for improving the quality of pivot machine translation and demonstrated it on Hebrew-Arabic and Persian-Arabic.
CAMeL researchers are working in collaboration with researchers in Columbia University on the problem of translating between Arabic dialects and English by exploiting standard Arabic resources.
CAMeL researchers are actively collaborating with researchers at the American University in Beirut and Qatar University on the development of advanced methods for Arabic Sentiment Analysis.
Visit Project Website
The goal of this work is to build speech-based search engines for low resource languages. There are several challenges in building such engines — this project focuses on two: mitigating the verbosity of spoken queries, and utilizing methods of speech processing that do not require a language model.
Recipe Parser
As an interactive human avatar dialogue system, TOIA (time-offset interaction application) is a bilingual (Arabic-English) conversational agent, similar to a chat bot, except that the avatar is based on a pre-recording of an actual human being. The system is designed to allow anybody, simply using a laptop, to create an avatar of themselves. As an interactive tool, TOIA can serve as a conversational medium of story telling, thus enabling cross-cultural and cross-generational sharing and preservation of stories.
Contact Nizar Habash if interested in getting access to the tool.
Qutr is a smart cross-lingual communication application for the travel domain. Qutr is a real-time messaging app that automatically translates conversations while supporting keyword-to-sentence matching. Qutr relies on querying a database that holds commonly used pre-translated travel-domain phrases and phrase templates in different languages with the use of keywords.
Khan, Shehroze, et al. "A Cross-lingual Messenger with Keyword Searchable Phrases for the Travel Domain." Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations. 2018.
Contact Nizar Habash if interested in getting access to the tool.
PALMYRA is a platform independent graphical dependency tree visualization and editing software specificaly designed to support the complexities of syntactic annotation of morphologically rich languages.
Project Website (Demo)
This MultiScript Phonetic Search algorithm addresses the problem of language learners looking up unfamiliar words that they heard. Our algorithm outperforms Google Translate’s “did you mean" feature, as well as the Yamli smart Arabic keyboard.
BOTTA is the first Arabic dialect chatbot, exploring the challenges of creating a conversational agent that aims to simulate friendly conversations using the Egyptian Arabic dialect.
The BOTTA database files are publicly available for researchers working on Arabic chatbot technologies. (wrong link)
Link to Publication
Automatic Dialect Identification for Arabic.
Contact Nizar Habash if interested in getting access to the tool.
The MADARi annotation interface is a joint morphological annotation and spelling correction system for texts in Standard and Dialectal Arabic. The MADARi framework provides intuitive interfaces for annotating text and managing the annotation process of a large number of sizable documents.
Contact Nizar Habash if interested in getting access to the tool.
A large publicly available dataset for evaluating machine translation between 22 European languages and Arabic. Arab-Acquis consists of over 12,000 sentences from the JRCAcquis (Acquis Communautaire) corpus translated twice by professional translators, once from English and once from French, and totaling over 600,000 words.
Habash, Nizar, et al. "A parallel corpus for evaluating machine translation between arabic and european languages." Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Vol. 2. 2017.
The Arabic Multidialectal Word Embeddings are a set of pre-trained word embeddings learned from different dialects of Arabic. Individual dialects are mapped into each other's dialectal vector spaces to enable inter-dialectal comparison. A dialect agnostic model is included as well, learned from the combined corpora of all dialects. Lastly, we include the seed and evaluation dictionaries used to perform and evaluate the mappings.
Erdmann, Alexander, Nasser Zalmout, and Nizar Habash. "Addressing Noise in Multidialectal Word Embeddings." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Vol. 2. 2018.
The Arabic Multidialectal Word Embeddings are a set of pre-trained word embeddings learned from different dialects of Arabic. Individual dialects are mapped into each other's dialectal vector spaces to enable inter-dialectal comparison. A dialect agnostic model is included as well, learned from the combined corpora of all dialects. Lastly, we include the seed and evaluation dictionaries used to perform and evaluate the mappings.
Erdmann, Alexander, Nasser Zalmout, and Nizar Habash. "Addressing Noise in Multidialectal Word Embeddings." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Vol. 2. 2018.
The Arabic Multidialectal Word Embeddings are a set of pre-trained word embeddings learned from different dialects of Arabic. Individual dialects are mapped into each other's dialectal vector spaces to enable inter-dialectal comparison. A dialect agnostic model is included as well, learned from the combined corpora of all dialects. Lastly, we include the seed and evaluation dictionaries used to perform and evaluate the mappings.
Erdmann, Alexander, Nasser Zalmout, and Nizar Habash. "Addressing Noise in Multidialectal Word Embeddings." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Vol. 2. 2018.
The Arabic Multidialectal Word Embeddings are a set of pre-trained word embeddings learned from different dialects of Arabic. Individual dialects are mapped into each other's dialectal vector spaces to enable inter-dialectal comparison. A dialect agnostic model is included as well, learned from the combined corpora of all dialects. Lastly, we include the seed and evaluation dictionaries used to perform and evaluate the mappings.
Erdmann, Alexander, Nasser Zalmout, and Nizar Habash. "Addressing Noise in Multidialectal Word Embeddings." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Vol. 2. 2018.
The Arabic Multidialectal Word Embeddings are a set of pre-trained word embeddings learned from different dialects of Arabic. Individual dialects are mapped into each other's dialectal vector spaces to enable inter-dialectal comparison. A dialect agnostic model is included as well, learned from the combined corpora of all dialects. Lastly, we include the seed and evaluation dictionaries used to perform and evaluate the mappings.
Erdmann, Alexander, Nasser Zalmout, and Nizar Habash. "Addressing Noise in Multidialectal Word Embeddings." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Vol. 2. 2018.
The Arabic Multidialectal Word Embeddings are a set of pre-trained word embeddings learned from different dialects of Arabic. Individual dialects are mapped into each other's dialectal vector spaces to enable inter-dialectal comparison. A dialect agnostic model is included as well, learned from the combined corpora of all dialects. Lastly, we include the seed and evaluation dictionaries used to perform and evaluate the mappings.
Erdmann, Alexander, Nasser Zalmout, and Nizar Habash. "Addressing Noise in Multidialectal Word Embeddings." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Vol. 2. 2018.