Autores: | Alberto Barrón-Cedeño |
URL: | http://users.dsic.upv.es/grupos/nle/resources/abc/download-coderiv.html |
Contacto: | Paolo Rosso <prossodsic.upv.es> |
Descripción
This corpus has been generated for the analysis of co-derivatives, text reuse and plagiarism (of course, simulated). It is composed of more than 20,000 documents from Wikipedia in German, English, Hindi and Spanish (around 5,000 documents per language). For each language, some of the most frequently consulted articles in Wikipedia have been considered as pivot and ten of its revisions were downloaded, which compose the set of co-derivatives. The corpus has three versions: (i) original (articles without further manipulation); (ii) clean (articles after case folding and punctuation marks elimination); and (iii) stopwords free (articles after case folding and punctuation marks and stopwords elimination).
Funcionalidad
It allows carrying out experiments on co-derivatives and text similarity analysis in the following languages: German, English, Hindi and Spanish.
Tecnología
Requisitos técnicos
Módulos
Innovación
The publicly available corpus for co-derivatives and text similarity analysis in German, English, Hindi and Spanish.
Desarrollo
- MICINN research project TEXT-ENTERPRISE 2.0 TIN2009-13391-C04-03 (Plan I+D+i).
- Developed as part of the Ph.D. Thesis of Alberto Barrón-Cedeño (writing-up phase).
Publicaciones
Barrón-Cedeño A., Eiselt A., Rosso P. Monolingual Text Similarity Measures: A Comparison of Models over Wikipedia Articles Revisions. In: Proc. 7th Int. Conf. on Natural Language Processing, ICON-2009, Hyderabad, India, December 15-17, pp. 29-38, 2009.