The Arabic Parallel Gender Corpus (APGC) is designed to support research on gender bias and personalization in natural language processing applications working on Arabic. The corpus comes in three versions v1.0, v2.0, and v2.1.
APGC v1.0 includes only first-person-singular sentences and was presented in the 2019 paper on “Automatic Gender Identification and Reinflection in Arabic” by Habash et al. in the First workshop on Gender Bias in Natural Language Processing. APGC v1.0 contains over 12,000 sentences annotated for first person singular grammatical gender, and over 200,000 synthetic sentences in masculine and feminine form.
APGC v2.0 expands on v1.0 by adding 2nd person targets as well as increasing the total number of sentences over 6.5 times, reaching over 590K words. AGPC v2.0 was introduced in the 2021 paper on “The Arabic Parallel Gender Corpus 2.0: Extensions and Analyses” in the 13th Language Resources and Evaluation Conference (LREC) by Alhafni et al. APGC v2.0 contains over 80,000 sentences annotated for first and second person grammatical genders covering singular, dual, and plural constructions.
APGC v2.1 extends the word-level annotations in v2.0 by marking the genders of both the base words and their pronominal enclitics. AGPC v2.1 was introduced in the 2022 paper on “User-Center Gender Rewriting” in the 2022 Conference of the North American Chapter of the Association for Computational Linguistics by Alhafni et al.
Habash, Nizar, Houda Bouamor, Christine Chung. 2019. Automatic Gender Identification and Reinflection in Arabic. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, Florence, Italy.
Alhafni, Bashar, Nizar Habash, Houda Bouamor. 2022. The Arabic Parallel Gender Corpus 2.0: Extensions and Analyses. In Proceedings of the 13th Language Resources and Evaluation Conference (LREC), Marseille, France.
Alhafni, Bashar, Nizar Habash, Houda Bouamor. 2022. User-Centric Gender Rewriting. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics, Seattle, Washington.