Research Areas

Machine Translation

A specific focus on translation for low resource languages, languages with rich morphologies, and hybrid approaches to machine translation.

Information Retrieval

With a specific focus on speech retrieval
Spoken Document Retrieval

Building speech-based search engines for low resource languages.

Research Projects

Gumar Corpus

Gumar is a morphologically annotated Gulf Arabic (GA) corpus. On its current state, it contains more than 112 million words that spans over 1,200 documents.

Morphological Analysis of Arabic

Arabic automatic processing is challenging for a number of reasons. First, Arabic words are morphologically rich. Second, un-digitized Arabic words are highly ambiguous. This is why morphology and specifically diacritization is vital for applications of Arabic Natural Language Processing.


Qusasat or Arabic Snippets is the application that has been developed for the 2016 NYUAD Hackathon for Social Good in the Arab World. This application won both First Place and Audience Choice award.


Cooking recipes exist in abundance; but due to their unstructured text format, they are hard to study quantitatively beyond treating them as simple bags of words. In this paper, we proposed an ingredient- instruction dependency tree data structure to represent recipes.

Spoken Document Retrieval

The goal of this work is to build speech-based search engines for low resource languages. There are several challenges in building such engines — this project focuses on two: mitigating the verbosity of spoken queries, and utilizing methods of speech processing that do not require a language model.