Gumar Corpus

Gumar is a morphologically annotated Gulf Arabic (GA) corpus. In its current state, it contains more than 112 million words that spans over 1,200 documents. These documents are long conversational novels that are published online anonymously, (aka روايات النت 'Internet novels'). This genre of novels is unique to the GA online content.

Gulf Arabic

Strictly speaking, Gulf Arabic refers to the linguistic varieties spoken on the western coast of the Arabian Gulf, in Bahrain, Qatar, and the seven Emirates of the UAE (Qafisheh, 1977), as well as in Kuwait and in Al-Hasā — the eastern region of Saudi Arabia (Holes, 1990). Omani, Hijazi, Najdi, and Baḥārna Arabic, among other additional dialects spoken in the Arabian Peninsula, are usually not included in grammars of Gulf Arabic due to the fact that they considerably vary in their linguistic features from the set of dialects listed above. In this current project, we extend the use of the term 'Gulf Arabic' to include any Arabic variety spoken by the indigenous populations residing in the six countries of the Gulf Cooperation Council: Bahrain, Kuwait, Oman, UAE, Qatar, and Saudi Arabia.

Visit the Project Website

Project Description

Corpus Collection

A unique genre of written material, that is specifically known to GA, is online anonymous publicly published long conversational novels. We have found a huge collection of these novels online in one place. We automatically downloaded about 1,200 MS Word documents. Usually, such novels are written in lengthy threads that can be found in online forums. The data we received was collected by volunteering forum members into MS Word documents and then published by another member in an organized matter.

Corpus Genre

The main theme of most of the novels is romantic, it also includes drama and sometimes tragedy. The structure of the novel is simple, it starts with a brief introduction that contains the title of the novel, the writer's pen name (no real names are used) and the country of the novel. The introduction is then followed by a prologue that usually contains a small piece of dialectal poetry or a small piece of literary writing usually in MSA. It also contains a brief description of the novel characters, though some writers prefer to introduce the characters as their role appears. Then comes the main body of the novel, which is often a dialogue between the characters, there is also some pieces of narration between conversations in either the dialect or MSA. The last part of the novel usually has some "moral" lessons narrated by the writer, writers also tend to ask the audience for positive criticism and opinions and whether they should continue writing more novels or not.

The targeted audience is mainly female teenagers, the nature of publishing the novels is highly interactive and dependent on the activity of the audience.

Reseachers

Nizar Habash (PI)
Salam Khalifa
Dana Abdulrahim (University of Bahrain)

Publications

Salam Khalifa, Nizar Habash, Fadhl Eryani, Ossama Obeid, Dana Abdulrahim and Meera Al Kaabi: A Morphologically Annotated Corpus of Emirati Arabic. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan.
Salam Khalifa, Sara Hassan, and Nizar Habash: A Morphological Analyzer for Gulf Arabic Verbs. In Proceedings of the WANLP 2017 (co-located with EACL 2017), Valencia, Spain,2017.
Salam Khalifa, Nizar Habash, Dana Abdulrahim and Sara Hassan: Gumar: A Large Scale Corpus of Gulf Arabic. In Proceedings of the Language Resources and Evaluation Conference (LREC), Portorož, Slovenia, 2016.

Acknowledgments

We wish to thank all the writers of the novels for sharing them publicly, though all are written under pen names. We would also like to thank the Graaam forum members who collected the scattered novels and put them together on MS words files and published them online.

We also thank the Curras members for sharing their web interface code that we built on to produce this website.