The Arabic Wikipedia XML corpus

Autores:	David Pinto(R30 version) and Ludovic Denoyer and Patrick Gallinari.
URL:	http://www.dsic.upv.es/grupos/nle/downloads.html
Contacto:	David Eduardo Pinto Avendaño <dpintocs.buap.mx>

Descripción

The 30 most frequent categories of the Arabic Wikipedia XML corpus gathered by Ludovic Denoyer and Patrick Gallinari were selected in order to provide a testbed for the single-label categorization task in the Arabic language.

Funcionalidad

The aim of this corpus is to support experiments of supervised and unsupervised classifiers with Arabic-witten texts. The gold standard is provided, as well as the tokenized and untokenized versions of this corpus.

Tecnología

The corpus (raw text of tokenized and untokenized versions) and the gold standard are provided.

Requisitos técnicos

No special requirements are needed in order to use the corpus.

Módulos

Innovación

This is an attempt to provide easy access to pre-processed texts in order to be used in the Arabic categorization task.

Desarrollo

Developed as part of David Pinto Ph.D. and the MiDES CICYT TIN2006-15265-C06-04 research project.

Publicaciones

David Pinto: On Clustering of Narrow Domain Short-Text Corpora. PhD Thesis, Universidad Politécnica de Valencia, Spain, July 2008.
David Pinto, Paolo Rosso, Yassine Benajiba, Anas Ahachad, Héctor Jiménez-Salazar: Word Sense Induction in the Arabic Language: A Self-Term Expansion Based Approach. The Egyptian Society of Language Engineering (ESOLE): 235-245, 2007.

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Índice recursos