Autores: | David Pinto(R30 version) and Ludovic Denoyer and Patrick Gallinari. |
URL: | http://www.dsic.upv.es/grupos/nle/downloads.html |
Contacto: | David Eduardo Pinto Avendaño <dpintocs.buap.mx> |
Descripción
The 30 most frequent categories of the Arabic Wikipedia XML corpus gathered by Ludovic Denoyer and Patrick Gallinari were selected in order to provide a testbed for the single-label categorization task in the Arabic language.
Funcionalidad
The aim of this corpus is to support experiments of supervised and unsupervised classifiers with Arabic-witten texts. The gold standard is provided, as well as the tokenized and untokenized versions of this corpus.
Tecnología
The corpus (raw text of tokenized and untokenized versions) and the gold standard are provided.
Requisitos técnicos
No special requirements are needed in order to use the corpus.
Módulos
Innovación
This is an attempt to provide easy access to pre-processed texts in order to be used in the Arabic categorization task.
Desarrollo
Developed as part of David Pinto Ph.D. and the MiDES CICYT TIN2006-15265-C06-04 research project.
Publicaciones
- David Pinto: On Clustering of Narrow Domain Short-Text Corpora. PhD Thesis, Universidad Politécnica de Valencia, Spain, July 2008.
- David Pinto, Paolo Rosso, Yassine Benajiba, Anas Ahachad, Héctor Jiménez-Salazar: Word Sense Induction in the Arabic Language: A Self-Term Expansion Based Approach. The Egyptian Society of Language Engineering (ESOLE): 235-245, 2007.