Autores: | Yassine Benajiba (Ph.D. student) and Paolo Rosso. |
URL: | http://www.dsic.upv.es/grupos/nle/ |
Contacto: | Yassine Benajiba <benajibayassine |
Descripción
ANERcorp is an Arabic NER corpus which consists of 150,000 tokens (which go up to 200,000 tokens after segmentation).
Funcionalidad
IOB annotated Arabic NER resource.
Tecnología
The corpus was annotated by one person to ensure annotation coherence. Each named entity is tagged by its class using the IOB annotation scheme following the guidelines of the corpora used in the CoNLL 2002 and 2003 evaluation campaigns.
Requisitos técnicos
None.
Módulos
-
Innovación
To our knowledge, it is the only freely available Arabic NER corpus.
Desarrollo
Developed as part of Yassine Benajiba’s AECI Ph.D. and the MiDES CICYT TIN2006-15265-C06-04 research project, co-funded by the AECI-PCI A01031707 and A706706 projects.
Publicaciones
- Benajiba Y., Rosso P. Arabic Named Entity Recognition using Conditional Random Fields. In: Proc. Workshop on HLT & NLP within the Arabic world. Arabic Language and local languages processing: Status Updates and Prospects, 6th Int. Conf. on Language Resources and Evaluation, LREC-2008, Marrakech, Morocco, May 26-31, 2008.
- Benajiba Y., Diab M. Rosso P. Arabic Named Entity Recognition: An SVM-based approach. In: Proc. Int. Arab Conf. on Information Technology, ACIT-2008, Hammamet, Tunisia, December, 2008.
- Benajiba Y., Rosso P., Benedí J.M. ANERsys: An Arabic Named Entity Recognition system based on Maximum Entropy. In: Proc. 8th Int. Conf. on Comput. Linguistics and Intelligent Text Processing, CICLing-2007, Springer-Verlag, LNCS(4394), pp. 143-153, 2008.