Yassine Benajiba (Ph.D. student) and Paolo Rosso
Yassine Benajiba <benajibayassinegmail.com>
A Named Entity Recognition model which is trained using an SVM-based approach over a 125,000 Arabic tokens training file.
The model allows the user to extract the named entities with an opn-domain text and classify them into 4 different categories, namely: person, location, organization and miscellaneous. In order to enhance the performance, the model was trained over ATB segmented data which helps to decrease the sparseness in Arabic data.
The model is trained using Support Vector Machines approach with the Yamcha Toolkit (http://chasen.org/~taku/software/yamcha/).
The input file should be ATB segmented and transliterated to Romanized characters. Also it requires Yamcha to be installed in the machine.
One module which consists of basic decoding on the data provided by the user.
To our knowledge, no Arabic NER systems are freely available for the research community. The model has been tested and the results have been presented at EMNLP and ACIT conferences.
Developed as part of Yassine Benajiba’s AECI Ph.D. and the MiDES CICYT TIN2006-15265-C06-04 research project, co-funded by the AECI-PCI A01031707 project.
- Benajiba Y., Diab M., Rosso P. Arabic Named Entity Recognition using Optimized Feature Sets. In: Proc. Int. Conf. on Empirical Methods in Natural Language Processing, EMNLP-2008, Waikiki, Honolulu, U.S.A., October, 2008.
- Benajiba Y., Diab M. Rosso P. Arabic Named Entity Recognition: An SVM-based approach. In: Proc. Int. Arab Conf. on Information Technology, ACIT-2008, Hammamet, Tunisia, December, 2008.