Resources
Corpora
-
Since its creation in 2008, the ANERcorp dataset (Benajiba & Rosso, 2008) has been a standard reference used by Arabic named entity recognition researchers around the world. However, over time, this dataset was copied over from user to user, modified slightly here and there, and split in many different configurations that made it hard to compare fairly across papers and systems.
In 2020, a group of researchers from CAMeL Lab (Habash, Alhafni and Oudah), and Mind Lab (Antoun and Baly) met with the creator of the corpus, Yassine Benajiba, to consult with him and collectively agree on an exact split, and accepted minor corrections from the original dataset. Bashar Alhafni from CAMeL Lab working with Nizar Habash implemented the decisions provided in this release.
-
ARAB-ACQUIS is a large dataset for evaluating machine translation between 22 European languages and Arabic. ARAB-ACQUIS consists of over 12,000 sentences from the JRC-ACQUIS (Acquis Communautaire) corpus translated twice by professional translators, once from English and once from French, and totaling over 600,000 words.
-
The Arabic Parallel Gender Corpus (APGC) is designed to support research on gender bias and personalization in natural language processing applications working on Arabic. The corpus comes in three versions v1.0, v2.0, and v2.1.
APGC v1.0 includes only first-person-singular sentences and was presented in the 2019 paper on “Automatic Gender Identification and Reinflection in Arabic” by Habash et al. in the First workshop on Gender Bias in Natural Language Processing. APGC v1.0 contains over 12,000 sentences annotated for first person singular grammatical gender, and over 200,000 synthetic sentences in masculine and feminine form.
APGC v2.0 expands on v1.0 by adding 2nd person targets as well as increasing the total number of sentences over 6.5 times, reaching over 590K words. AGPC v2.0 was introduced in the 2021 paper on “The Arabic Parallel Gender Corpus 2.0: Extensions and Analyses” in the 13th Language Resources and Evaluation Conference (LREC) by Alhafni et al. APGC v2.0 contains over 80,000 sentences annotated for first and second person grammatical genders covering singular, dual, and plural constructions.
APGC v2.1 extends the word-level annotations in v2.0 by marking the genders of both the base words and their pronominal enclitics. AGPC v2.1 was introduced in the 2022 paper on “User-Center Gender Rewriting” in the 2022 Conference of the North American Chapter of the Association for Computational Linguistics by Alhafni et al.
Habash, Nizar, Houda Bouamor, Christine Chung. 2019. Automatic Gender Identification and Reinflection in Arabic. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, Florence, Italy.
Alhafni, Bashar, Nizar Habash, Houda Bouamor. 2022. The Arabic Parallel Gender Corpus 2.0: Extensions and Analyses. In Proceedings of the 13th Language Resources and Evaluation Conference (LREC), Marseille, France.
Alhafni, Bashar, Nizar Habash, Houda Bouamor. 2022. User-Centric Gender Rewriting. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics, Seattle, Washington.
-
In the Bahrain Corpus, we aimed to create a specialized corpus of the Bahraini Arabic dialect, which includes written texts as well as transcripts of audio files, belonging to a different genre (folktales, comedy shows, plays, cooking shows, etc.).
Abdulrahim, Dana, Go Inoue, Latifa Shamsan, Salam Khalifa, and Nizar Habash. The Bahrain Corpus: A Multi-genre Corpus of Bahraini Arabic. In Proceedings of The 13th Language Resources and Evaluation Conference (LREC2022), pages 2345-2352, Marseille, France. 2022. European Language Resources Association (ELRA).
-
The Camel Treebank (CamelTB) is an open-source dependency treebank of Modern Standard and Classical Arabic.
CamelTB 1.0 includes 13 sub-corpora (188K words) comprising selections of texts from pre-Islamic poetry to social media online commentaries, and covering a range of genres from religious and philosophical texts to news, novels, and student essays.
-
CAMeL researchers have collected a 100M word corpus of Gulf Arabic, named the GUMAR Corpus. CAMeL Researchers are working on the automatic annotation of the corpus and on developing gold data annotations for 200,000 words from eight novels. Work on GUMAR will lead to improved tools for Gulf Arabic automatic analysis.
CAMeL Lab has also made Gumar n-grams available for download. These N-grams are in order of 5, that is 5, 4, 3, 2 and 1 grams with their respective frequency counts and number of documents they appear in. The format of the n-gram files follows a similar format of Google n-grams.
Khalifa, Salam, Nasser Zalmout, and Nizar Habash. Morphological Analysis and Disambiguation for Gulf Arabic: The Interplay between Resources and Methods. In The Proceedings of LREC 2020.
Khalifa, Salam, et al. "A Morphologically Annotated Corpus of Emirati Arabic." LREC. 2018.
-
The HelloThere Corpus is a collection of question-answer pairs and user interactions designed for research in conversational AI and time-offset interactive dialogue systems. This dataset captures real-world interactions between users and time-offset interaction avatars, providing valuable insights into user behavior, question patterns, and response effectiveness in time-offset dialogue systems.
-
A spelling correction corpus of dialectal Arabic corrected using the Conventional Orthography for Dialectal Arabic (CODA). The corpus contains 10,000 sentences covering the cities of Beirut, Cairo, Doha, Rabat, and Tunis.
Eryani, Fadhl, Nizar Habash, Houda Bouamor, and Salam Khalifa. "A spelling correction corpus for multiple Arabic dialects." In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 4130-4138. 2020.
-
The MADAR-Turk corpus adds Turkish sentences to the MADAR Corpus (Bouamor et al., 2018), which provided the first set of parallel sentences to include the dialects of 25 Arab cities in addition to English, French, and MSA. The MADAR Corpus was built on the Basic Traveling Expression Corpus (BTEC) (Takezawa et al., 2007) and comprised about 20,000 English tourism-related sentences. BTEC is conversational in nature, has short sentences, and has translations in several languages, making it an attractive resource for building and testing machine translation models.
To create MADAR-Turk, two native Arabic speakers from Syria who are highly fluent in Turkish translated all 2,000 sentences from the Damascus dialect entries because our initial objective was to work on Syrian Arabic to Turkish machine translation. -
The Margarita Dialogue Corpus is a collection of question-answer pairs defined both outside the context of a conversation and in the context of dialogues between one person and different people. This corpus is part of a methodology developed for creating the knowledge base for time-offset interaction applications and unstructured dialogue systems.
Chierici, Alberto, Nizar Habash, and Margarita Bicec. "The Margarita Dialogue Corpus: A Data Set for Time-Offset Interactions and Unstructured Dialogue Systems." In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 476-484. 2020.
-
The MADAR project (Multi-Arabic Dialect Applications and Resources) is a three year project that is a collaboration between CAMeL lab, the NLP group at Carnegie Mellon University in Qatar and Columbia University.
The MADAR Corpus is a collection of parallel sentences covering the dialects of 25 cities from the Arab World, in addition to English, French, and Modern Standard Arabic (MSA). This corpus is created by translating selected sentences from the Basic Traveling Expression Corpus (BTEC) in French and English to the different dialects.
The MADAR Corpus is made available to the research community under a non-commercial license.
-
The Multidialectal Arabic parallel corpus is a collection of 1,000 sentences in Modern Standard Arabic, Egyptian, Tunisian, Jordanian, Palestinian and Syrian Arabic, in addition to English.
-
CAMeL researchers are currently developing a treebank of texts annotated in the Universal Dependency syntactic representation. The NYUAD Arabic UD treebank (NUDAR) is based on the Penn Arabic Treebank (PATB), parts 1, 2, and 3, through conversion to CATiB dependency trees.
-
A corpus of manually corrected Arabic text for building automatic correction tools.
Version 0.9.1 (03 Dec 2021): We added the QALB 2015 test set references, and the system outputs from QALB shared tasks 2014 and 2015.
-
The Zayed Arabic-English Bilingual Undergraduate Corpus (ZAEBUC) is a new kind of corpus, which focuses on a large set of bilingual writers and comprises samples of their writing in both their languages.
Habash, Nizar, and David Palfreyman. ZAEBUC: An Annotated Arabic-English Bilingual Writer Corpus. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), pp. 79-88, Marseille, 2022.
Lexicons
-
A tool and dataset for the automatic transliteration of undiacritized Arabic text following the American Library Association - Library of Congress (ALA-LC) Romanization standard.
Fadhl Eryani & Nizar Habash. Automatic Romanization of Arabic Bibliographic Records. In Proceedings of the Arabic Natural Language Processing Workshop. 2021.
-
The Arabic Multidialectal Word Embeddings are a set of pre-trained word embeddings learned from different dialects of Arabic. Individual dialects are mapped into each other's dialectal vector spaces to enable inter-dialectal comparison. A dialect agnostic model is included as well, learned from the combined corpora of all dialects. Lastly, we include the seed and evaluation dictionaries used to perform and evaluate the mappings.
-
ArabScribe was a senior capstone project interested in addressing the problem of language learners looking up unfamiliar words that they heard. The project created an improved phonetic search algorithm, and a data set of 10,000 impressions on native and non-native Arabic speakers of what a word is based on its audio alone.
Publication
Zhang, Lingliang, Nizar Habash, and Godfried Toussaint. "Robust Dictionary Lookup in Multiple Noisy Orthographies." Proceedings of the Third Arabic Natural Language Processing Workshop. 2017.
-
Maknuune is a large open lexicon for the Palestinian Arabic dialect. It has over 36K entries from 17K lemmas, and 3.7K roots. All entries include diacritized Arabic orthography, phonological transcription and English glosses. Some entries are enriched with additional information such as broken plurals and templatic feminine forms, associated phrases and collocations, Standard Arabic glosses, and examples or notes on grammar, usage, or location of collected entry.
Dibas, Shahd, Christian Khairallah, Nizar Habash, Omar Fayez Sadi, Tariq Sairafy, Karmel Sarabta, Abrar Ardah. "Maknuune: A Large Open Palestinian Arabic Lexicon." In proceedings of the 7th Arabic Natural Language Processing Workshop (WANLP 2022).
-
The MADAR project (Multi-Arabic Dialect Applications and Resources) is a three year project that is a collaboration between CAMeL lab, the NLP group at Carnegie Mellon University in Qatar and Columbia University.
The MADAR lexicon is a collection of 1,045 concepts extracted from the MADAR Corpus defined in terms of triplets of words and phrases from English, French and MSA, along with multiple equivalent dialectal forms covering 25 cities from the Arab World. Each dialectal form includes its CODA orthography and CAPHI phonology.
-
The SAMER readability lexicon is a large-scale 26,000-lemma leveled readability lexicon for Modern Standard Arabic. The lexicon was manually annotated in triplicate by language professionals from three regions in the Arab world.
Al Khalil, Muhamed, Nizar Habash, Zhengyang Jiang. "A Large-Scale Leveled Readability Lexicon for Standard Arabic." In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 3053-3062 Marseille, 2020.
Tools
-
BOTTA is the first Arabic dialect chatbot, exploring the challenges of creating a conversational agent that aims to simulate friendly conversations using the Egyptian Arabic dialect.
Dana Abu Ali and Nizar Habash. 2016. Botta: An Arabic Dialect Chatbot. In Proceedings of COLING 2016, System Demonstrations, Osaka, Japan.
-
A rich Arabic morphological analyzer and generator that provides functional and form-based morphological features as well as built-in tokenization, phonological representation, lexical rationality, reinflection, and much more.
-
CAMeLBERT is a collection of BERT models pre-trained on Arabic texts with different sizes and variants. We release pre-trained language models for Modern Standard Arabic, dialectal Arabic, and classical Arabic, in addition to a model pre-trained on a mix of the three.
Inoue, Go et al. "The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models." Proceedings of The Sixth Arabic Natural Language Processing Workshop. 2021.
-
CAMeLBERT Morphosyntactic Tagger is the state-of-the-art system for morphosyntactic tagging for Modern Standard Arabic (MSA) and three Arabic dialects (Egyptian, Gulf, and Levantine).
Selected models are available through CAMeL Tools.
Inoue, Go, Salam Khalifa, and Nizar Habash. 2022. Morphosyntactic Tagging with Pre-trained Language Models for Arabic and its Dialects. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1708–1719, Dublin, Ireland. Association for Computational Linguistics.
-
CamelParser 2.0 is an open-source Python-based Arabic dependency parser targeting two popular Arabic dependency formalisms, the Columbia Arabic Treebank (CATiB), and Universal Dependencies (UD).
The CamelParser pipeline handles the processing of raw text and produces tokenization, part-of-speech and rich morphological features. For disambiguation, users can choose between the BERT unfactored disambiguator, or a lighter Maximum Likelihood Estimation (MLE) disambiguator, both of which are included in CAMeL Tools. For dependency parsing, we use the SuPar Biaffine Dependency Parser.
For the previous version, CamelParser 1.0, please visit this page. Note that this older version is no longer the state-of-the-art and has been archived.
-
An Open Source Python Toolkit for Arabic Natural Language Processing. CAMeL Tools currently provides utilities for pre-processing, morphological modeling, Dialect Identification, Named Entity Recognition and Sentiment Analysis.
Obeid, Ossama, et al. "CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing." Proceedings of The 12th Language Resources and Evaluation Conference. 2020.
-
CAMeL researchers developed a Chrome Extension tool that supports learning Arabic and Arabic dialects by providing translations and word analysis for words on any web page.
-
The Gulf Arabic morphological analyzer (CALIMAGLF) currently covers over 2,600 verbal lemmas. It models fully inflected paradigms and orthographic varieties.
-
CAMeL collaborates actively with Columbia University and George Washington University on the development and improvement of the state-of-the-art Arabic Morphological tagger, MADAMIRA for Standard and Dialectal Arabic.
-
The MADAR Annotaion Interface (MADARi) is a web-based framework for joint morphological annotation and spelling adjustment.
To obtain access to MADARi, please contact Dr. Nizar Habash.
-
Palmyra is a platform-independent graphical tool for syntactic dependency annotation supporting languages that require complex morphological tokenization.
Dima Taji and Nizar Habash. PALMYRA 2.0: A Configurable Multilingual Platform Independent Tool for Morphology and Syntax Annotation. In Proceedings of Universal Dependencies Workshop (UDW) 2020.
Talha Javed, Nizar Habash, and Dima Taji. 2018. Palmyra: A Platform Independent Dependency Annotation Tool for Morphologically Rich Languages. In Proceedings of LREC 2018.
-
Aiming towards a semantic representation of cooking recipes, Simplified Ingredient Merging Map in Recipes (SIMMR) proposes an ingredient-instruction dependency tree, and contains an automatic recipe parser and database.
-
TOIA is a cloud-based user-centered time-offset interaction application.
Chierici, Alberto et al., "A Cloud-based User-Centered Time-Offset Interaction Application." Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue. 2021.
-
YAMAMA (Yet Another Multi- Dialect Arabic Morphological Analyzer; Arabic يمامة ‘Barbary Dove’), is a multi-dialect Arabic morphological analyzer and disambiguator, that is five times faster than the state-of-the-art MADAMIRA with a slightly lower quality.
To obtain YAMAMA, please contact Dr. Nizar Habash.
Guidelines
-
The CAMeL Arabic Phonetic Inventory (CAPHI) is a system for representing, i.e. transcribing, the production of Arabic utterances in any dialect, from Modern Standard Arabic (MSA) to the regional colloquial varieties. CAPHI represents every significant sound in all Arabic dialects with a unique letter, meaning that it can be used to represent different pronunciations of words that would otherwise be spelled in the same way according to MSA, CODA*, Arabizi, or other Arabic spelling standards.
-
CAMeL researchers, in collaboration with researchers in a number of universities, have developed a CODA (Conventional Orthography for Dialectal Arabic) — a computational “standard” for writing Arabic dialects. Earlier versions of CODA targeted specific Arabic dialects: Egyptian, Palestinian, Tunisian, Algerian, and Gulf. In its most recent iteration, the guidelines for CODA-Star (as in for any dialect) cover 28 city dialects.
-
CAMEL POS is inspired by the ARZATB tagset and guidelines (Maamouri et al., 2012) which is based on the PATB guidelines (Maamouri et al., 2009). The CAMEL POS is designed as single tagset for both MSA and the dialects with the following goals in mind: (a) facilitating research on adaptation between MSA and the dialects, and among the dialects; (b) supporting backward compatibility with previously annotated resources; and (c) enforcing a functional morphology analysis that is deeper and more compatible with Arabic morphosyntactic rules than form based analysis (Alkuhlani and Habash, 2011). The CAMEL POS tags and features are the union of those in MSA and the dialects. Features are available to use when needed.
-
The Columbia Arabic Treebank (CATiB) formalism is a syntactic dependency representation used for the Arabic language.
CATiB contrasts with previous approaches to Arabic treebanking in its emphasis on speed with some constraints on linguistic richness. Two basic ideas inspire the CATiB approach: minimizing annotation of redundant information and using representations and terminology inspired by traditional Arabic syntax.