Co-derivatives corpus

Autores:	Alberto Barrón-Cedeño
URL:	http://users.dsic.upv.es/grupos/nle/resources/abc/download-coderiv.html
Contacto:	Paolo Rosso <prossodsic.upv.es>

Descripción

This corpus has been generated for the analysis of co-derivatives, text reuse and plagiarism (of course, simulated). It is composed of more than 20,000 documents from Wikipedia in German, English, Hindi and Spanish (around 5,000 documents per language). For each language, some of the most frequently consulted articles in Wikipedia have been considered as pivot and ten of its revisions were downloaded, which compose the set of co-derivatives. The corpus has three versions: (i) original (articles without further manipulation); (ii) clean (articles after case folding and punctuation marks elimination); and (iii) stopwords free (articles after case folding and punctuation marks and stopwords elimination).

Funcionalidad

It allows carrying out experiments on co-derivatives and text similarity analysis in the following languages: German, English, Hindi and Spanish.

Tecnología

Requisitos técnicos

Módulos

Innovación

The publicly available corpus for co-derivatives and text similarity analysis in German, English, Hindi and Spanish.

Desarrollo

MICINN research project TEXT-ENTERPRISE 2.0 TIN2009-13391-C04-03 (Plan I+D+i).
Developed as part of the Ph.D. Thesis of Alberto Barrón-Cedeño (writing-up phase).

Publicaciones

Barrón-Cedeño A., Eiselt A., Rosso P. Monolingual Text Similarity Measures: A Comparison of Models over Wikipedia Articles Revisions. In: Proc. 7th Int. Conf. on Natural Language Processing, ICON-2009, Hyderabad, India, December 15-17, pp. 29-38, 2009.

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Índice recursos