Successful Research Year for CAMeL
2019 Lab Achievements
A successful research year for the Computational Approaches to Modeling Language Lab (CAMeL) at NYUAD
Researchers and students in the Computational Approaches to Modeling Language lab (CAMeL) at New York University Abu Dhabi have published 14 papers and released four resources in 2019 in the field of natural language processing. Some of the papers were presented at the following conferences: NAACL 2019 (Minneapolis, USA), ACL 2019 (Florence, Italy), MT Summit 2019 (Dublin, Ireland), Interspeech 2019 (Graz, Austria), and EMNLP 2019 (Hong Kong, China). Some of these efforts were in collaboration with researchers from other institutions including American University of Beirut, Carnegie Mellon University Qatar, Columbia University, Element AI, Google, Ohio State University, and the Qatar Computing Research Institute.
The CAMeL Lab research areas include developing new artificial intelligence algorithms for language processing, creating resources and tools to support research in computational linguistics, as well as creating new annotation standards and guidelines with a focus on the Arabic language and its dialects.
Publications by Theme
Computational Morphology
- Adversarial Multitask Learning for Joint Multi-Feature and Multi-Dialect Morphological Modeling by Nasser Zalmout and Nizar Habash. (ACL 2019).
- A Little Linguistics Goes a Long Way:Unsupervised Segmentation with Limited Language Specific Guidance by Alexander Erdmann, Salam Khalifa, Mai Oudah, Houda Bouamor and Nizar Habash. (SIGMORPHON 2019, co-located with ACL).
- Morphologically Annotated Corpora for Seven Arabic Dialects: Taizi, Sanaani, Najdi, Jordanian, Syrian, Iraqi and Moroccan by Faisal Alshargi, Shahd Dibas, Sakhar Alkhereyf, Reem Faraj, Basmah Abdulkareem, Sane Yagi, Ouafaa Kacha, Nizar Habash and Owen Rambow. (WANLP 2019, co-located with ACL).
- Joint Diacritization, Lemmatization, Normalization, and Fine-Grained Morphological Tagging by Nasser Zalmout and Nizar Habash. (Arxiv)
Dialect Identification
- The MADAR Shared Task on Arabic Fine-Grained Dialect Identification by Houda Bouamor, Sabit Hassan, and Nizar Habash. (WANLP 2019, co-located with ACL).
- ADIDA: Automatic Dialect Identification for Arabic by Ossama Obeid, Mohammad Salameh, Houda Bouamor, and Nizar Habash. (NAACL 2019).
Information Extraction
- Unsupervised Neologism Normalization Using Embedding Space Mapping by Nasser Zalmout, Kapil Thadani, and Aasish Pappu. (W-NUT 2019, co-located with EMNLP).
- The Effectiveness of Simple Hybrid Systems for Hypernym Discovery by William Held and Nizar Habash. (ACL 2019).
- Practical, Efficient, and Customizable Active Learning for Named Entity Recognition in the Digital Humanities by Alexander Erdmann, David Joseph Wrisley, Benjamin Allen, Christopher Brown, Sophie Cohen-Bodénès, Micha Elsner, Yukun Feng, Brian Joseph, Béatrice Joyeux-Prunel, and Marie-Catherine de Marneffe. (NAACL 2019).
Machine Translation
- The Impact of Preprocessing on Arabic-English Statistical and Neural Machine Translation by Mai Oudah, Amjad Almahairi, and Nizar Habash. (MT Summit 2019).
- Simple Automatic Post-editing for Arabic-Japanese Machine Translation by Ella Noll, Mai Oudah, and Nizar Habash. (Arxiv)
Sentiment Analysis
- A Survey of Opinion Mining in Arabic: A Comprehensive System Perspective Covering Challenges and Advances in Tools, Resources, Models, Applications and Visualizations by Gilbert Badaro, Ramy Baly, Hazem Hajj, Wassim El-Hajj, Khaled Shaban, Nizar Habash, Ahmad Sallab, and Ali Hamdi. (TALLIP).
Other
Speech Evaluation
- Towards Variability Resistant Dialectal Speech Evaluation by Ahmed Ali, Salam Khalifa, and Nizar Habash. (Interspeech 2019).
Gender Bias in AI
- Automatic Gender Identification and Reinflection in Arabic by Nizar Habash, Houda Bouamor, and Christine Chung. (GEBNLP 2019, co-located with ACL).
Resources
Tools
- ADIDA Interface - An automatic dialect identification tool for Arabic.
Corpora
- The Margarita Dialogue Corpus - A collection of out-of-context and in-context question-answer pairs for developing time-offset interaction dialogue systems.
- MADAR Parallel Corpus Dataset - A parallel corpus of 25 Arab city dialects created as part of the Multi Arabic Dialect Applications and Resources (MADAR) project.
- MADAR Shared Task 2019: Arabic Fine-Grained Dialect Identification Dataset - contains data from the MADAR Parallel Corpus as well as additional Twitter data used in the WANLP 2019 dialect identification shared task.