Fine-grained Arabic Named Entity Corpora

Fine-grained Arabic Named Entity Corpora

About

The gold-standard and automatically-developed fine-grained Arabic named entity corpora are resources created by annotating Named Entities into 50 fine-grained classes.

The annotation uses two-levels taxonomy in which an entity has been annotated into coarse- and fine-grained classes.

Manually gold-standard:

1) WikiFANE_Gold: Gold standard Wikipedia-based Fine-grained Arabic Named Entity Corpus, ~500K tokens

2) NewsFANE_Gold: Gold standard Newswire-based Fine-grained Arabic Named Entity Corpus, ~170K tokens.

Those corpora have been manually annotated from the Arabic Wikipedia and Newswire sources respectively.

DOWNLOAD
Mirror download via sourceforge.net

To use the gold-standard corpora, please cite (Alotaibi and Lee, 2014) COLING 2014's paper.

Automatically-developed:

1) WikiFANE_Whole: All sentences of the Arabic Wikipedia articles were retrieved to compile to corpus. ~2M tokens.

2) WikiFANE_Selective: Sentences which have at least one NE phrase were retrieved to compile the corpus. ~2M tokens.

DOWNLOAD
Mirror download via sourceforge.net

To use the automatically-developed corpora, please cite (Alotaibi and Lee, 2013) IJCNLP 2013's paper.

References

F. Alotaibi and M. Lee, "Automatically Developing a Fine-grained Arabic Named Entity Corpus and Gazetteer by utilizing Wikipedia", In Proceedings of IJCNLP, p392-400. Nagoya, Japan, October, 2013. (acceptance rate: 23.4%)

F. Alotaibi and M. Lee, "A Hybrid Approach to Features Representation for Fine-grained Arabic Named Entity Recognition", In Proceedings of COLING 2014, Dublin, Ireland, August 23-29. 2014.

Contact the author

http://www.cs.bham.ac.uk/~fsa081

http://fsalotaibi.kau.edu.sa

For correspondence, please contact Fahd Alotaibi on "fsa081~AT~cs.bham.ac.uk" or "fsalotaibi~AT~kau.edu.sa"


آخر تحديث
6/13/2014 1:26:48 AM