Fine-grained Arabic Named Entity Corpora
About
The gold-standard and automatically-developed fine-grained Arabic named entity corpora are resources created by annotating Named Entities into 50 fine-grained classes.
The annotation uses two-levels taxonomy in which an entity has been annotated into coarse- and fine-grained classes.
Manually gold-standard:
1) WikiFANE_Gold: Gold standard Wikipedia-based Fine-grained Arabic Named Entity Corpus, ~500K tokens
2) NewsFANE_Gold: Gold standard Newswire-based Fine-grained Arabic Named Entity Corpus, ~170K tokens.
Those corpora have been manually annotated from the Arabic Wikipedia and Newswire sources respectively.
DOWNLOAD
Mirror download via sourceforge.net
To use the gold-standard corpora, please cite (Alotaibi and Lee, 2014) COLING 2014's paper.
Automatically-developed:
1) WikiFANE_Whole: All sentences of the Arabic Wikipedia articles were retrieved to compile to corpus. ~2M tokens.
2) WikiFANE_Selective: Sentences which have at least one NE phrase were retrieved to compile the corpus. ~2M tokens.
DOWNLOAD
Mirror download via sourceforge.net
To use the automatically-developed corpora, please cite (Alotaibi and Lee, 2013) IJCNLP 2013's paper.
References
F. Alotaibi and M. Lee, "Automatically Developing a Fine-grained Arabic Named Entity Corpus and Gazetteer by utilizing Wikipedia", In Proceedings of IJCNLP, p392-400. Nagoya, Japan, October, 2013. (acceptance rate: 23.4%)
F. Alotaibi and M. Lee, "A Hybrid Approach to Features Representation for Fine-grained Arabic Named Entity Recognition", In Proceedings of COLING 2014, Dublin, Ireland, August 23-29. 2014.
Contact the author
http://www.cs.bham.ac.uk/~fsa081
http://fsalotaibi.kau.edu.sa
For correspondence, please contact Fahd Alotaibi on "fsa081~AT~cs.bham.ac.uk" or "fsalotaibi~AT~kau.edu.sa"
|