Arabic automatic processing is challenging for a number of reasons. First, Arabic words are morphologically rich. Second, undiacritized Arabic words are highly ambiguous. This is why morphology, and specifically diacritization is vital for applications of Arabic Natural Language Processing. We describe below some of the projects in the CAMeL lab related to Arabic morphology.
CAMeL collaborates actively with Columbia University and George Washington University on the development and improvement of the state-of-the-art Arabic Morphological tagger, MADAMIRA. Arabic processing with MADAMIRA includes automatic diacritization, lemmatization, morphological analysis and disambiguation, part-of-speech tagging, stemming, glossing, (configurable) tokenization, base-phrase chunking, and named-entity recognition. These tasks are important steps taken as part of a solution to a larger, more complex natural language processing problem, such as machine translation. See the MADAMIRA demo and download information.
This project employs a novel technique for Arabic morphological annotation that utilizes diacritization to produce morphological annotations of quality comparable to human annotators. Although Arabic text is generally written without diacritics, diacritization is already available for large corpora of Arabic text in several genres. Furthermore, diacritization can be generated at a low cost for new texts as it does not require specialized training beyond what educated Arabic typists to know. We plan to use this technique to automatically generate large annotated Arabic corpora in less studied genres. The corpora will be diacritized by volunteers and paid typists (potentially using crowdsourcing), and the generated annotations can then be used to enhance Arabic tools.
Modern Standard Arabic (MSA) orthography generally omits diacritical marks which encode lexical as well as syntactic information. The task of Arabic automatic diacritization is about the automatic restoration of the missing diacritics. Diacritization improvement in Arabic has important implications for downstream processing for Arabic Natural Language Processing, e.g. speech recognition, speech synthesis, and machine translation.
Previous efforts used morphological tagging to disambiguate word forms, which worked relatively well on lexical diacritic, but not as well on syntactic case diacritics. This suggested that syntactic analysis may help with automatic diacrtization. In this project, CAMeL researchers developed an approach for improving the quality of automatic Arabic diacritization through the use of automatic syntactic analysis. This approach combines handwritten rules for case assignment and agreement with machine learning of case and state adjustment on the output of a state-of-the-art morphological tagger.
Habash, Nizar, Anas Shahrour and Muhamed Al-Khali (2016)l. Exploiting Arabic Diacritization for High-Quality Automatic Annotation. In Proceedings of the Language Resources and Evaluation Conference (LREC), Portorož, Slovenia.
Shahrour, Anas, Salam Khalifa and Nizar Habash (2015). Improving Arabic Diacritization through Syntactic Analysis. In Proceedings of EMNLP, Lisbon.