Since its creation in 2008, the ANERcorp dataset (Benajiba & Rosso, 2008) has been a standard reference used by Arabic named entity recognition researchers around the world. However, over time, this dataset was copied over from user to user, modified slightly here and there, and split in many different configurations that made it hard to compare fairly across papers and systems.
In 2020, a group of researchers from CAMeL Lab (Habash, Alhafni and Oudah), and Mind Lab (Antoun and Baly) met with the creator of the corpus, Yassine Benajiba, to consult with him and collectively agree on an exact split, and accepted minor corrections from the original dataset. Bashar Alhafni from CAMeL Lab working with Nizar Habash implemented the decisions provided in this release.