Fine-grained Arabic Named Entity Corpora


The gold-standard and automatically-developed fine-grained Arabic named entity corpora are resources created by annotating Named Entities into 50 fine-grained classes.

The annotation uses two-levels taxonomy in which an entity has been annotated into coarse- and fine-grained classes.

Manually gold-standard:

1) WikiFANE_Gold: Gold standard Wikipedia-based Fine-grained Arabic Named Entity Corpus, ~500K tokens

2) NewsFANE_Gold: Gold standard Newswire-based Fine-grained Arabic Named Entity Corpus, ~170K tokens.

Those corpora have been manually annotated from the Arabic Wikipedia and Newswire sources respectively.

To use the gold-standard corpora, please cite (Alotaibi and Lee, 2014) COLING 2014's paper.


1) WikiFANE_Whole: All sentences of the Arabic Wikipedia articles were retrieved to compile to corpus. ~2M tokens.

2) WikiFANE_Selective: Sentences which have at least one NE phrase were retrieved to compile the corpus. ~2M tokens.

To use the automatically-developed corpora, please cite (Alotaibi and Lee, 2013) IJCNLP 2013's paper.


F. Alotaibi and M. Lee, "Automatically Developing a Fine-grained Arabic Named Entity Corpus and Gazetteer by utilizing Wikipedia", In Proceedings of IJCNLP, p392-400. Nagoya, Japan, October, 2013. (acceptance rate: 23.4%)

F. Alotaibi and M. Lee, "A Hybrid Approach to Features Representation for Fine-grained Arabic Named Entity Recognition", In Proceedings of COLING 2014, Dublin, Ireland, August 23-29. 2014.

