Corpus R8+

Autores:	Diego Ingaramo, Marcelo Errecalde (Universidad Nacional de San Luis (Argentina)), Paolo Rosso
URL:	https://sites.google.com/site/merrecalde/resources
Contacto:	Marcelo Errecalde <merrecaunsl.edu.ar>, Paolo Rosso <prossodsic.upv.es>

Descripción

Corpus R8+. Subset of documents of the R8-Test corpus, a sub-collection of the well-known Reuters-21578 dataset. R8+ has the same number of groups as R8-Test (eight groups), but they differ in the number of documents per group. Each group of R8+ only contains the largest documents in the corresponding group of R8-Test (a 20% of the documents of each original group). Features of R8+: Number of groups = 8, Number of documents = 445, number of terms = 66314, vocabulary size = 7797, (average) number of terms per document = 149.02.

Funcionalidad

This corpus is intended to be used in supervised or unsupervised categorization tasks which mainly involve working with short length texts. However, R8+ only contains the largest documents in R8-Test in order to analyze how difficult a collection with these particularities is, with respect to collections with “very short” length documents like R8-, a collection similar to R8+, but generated with the shortest documents of R8-Test. Both collections were simultaneously generated in previous studies to consider the “shortest length” and the “largest length” versions of R8-Test.

Tecnología

The development of this corpus did not require any special development tool beyond the very simple routines to select a 20% of the largest documents in each R8-Test’s group.

Requisitos técnicos

No special hardware/software is required. Disk space required: 440 Kbytes.

Módulos

Innovación

Unlike R8-Test, which contains relatively short documents, this corpus allows to focus on the particularities that present working with the largest documents of this collection. This study, is usually complemented with results obtained with R8-, which was generated with the shortest documents of R8-Test.

Desarrollo

MICINN research project TEXT-ENTERPRISE 2.0 TIN2009-13391-C04-03 (Plan I+D+i).
This corpus was generated as part of the Ph.D. work of Diego Ingaramo under the supervision of Marcelo Errecalde (external researcher of TEXT-ENTERPRISE 2.0) and Paolo Rosso.

Publicaciones

Ingaramo D., Errecalde M., Rosso P.A general bio-inspired method to improve the short-text clustering task. In: Proc. 10th Int. Conf. on Comput. Linguistics and Intelligent Text Processing, CICLing-2010, Springer-Verlag, LNCS(6008), pp. 661-672, 2010.
Errecalde M., Ingaramo D., Rosso P. ITSA*: An Effective Iterative Method for Short-Text Clustering Tasks. In: Proc. 23rd Int. Conf. on Industrial, Engineering & Other Applications of Applied Intelligent Systems , IEA-AIE-2010, Springer-Verlag, LNAI(6096), pp. 550-559, 2010.
Rosas M., Errecalde M., Rosso P. Un Análisis Comparativo de Estrategias para la Categorización Semántica de Textos Cortos. In: Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN), num. 44, pp. 11-18, 2010.
Ingaramo D., Rosas M.V., Errecalde M., Rosso P. Clustering Iterativo de Textos Cortos con Representaciones basadas en Conceptos. In: Proc. Workshop on Natural Language Processing and Web-based Technologies, 12th edition of the Ibero-American Conference on Artificial Intelligence, IBERAMIA-2010, Bahía Blanca, Argentina, November 1-5, pp. 80-89, 2010.

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Índice recursos

Descripción

Funcionalidad

Tecnología

Requisitos técnicos

Módulos

Innovación

Desarrollo

Publicaciones