Corpus R8B

Autores:	iego Ingaramo, Marcelo Errecalde (Universidad Nacional de San Luis (Argentina)), Paolo Rosso
URL:	https://sites.google.com/site/merrecalde/resources
Contacto:	Marcelo Errecalde <merrecaunsl.edu.ar>, Paolo Rosso <prossodsic.upv.es>

Descripción

Corpus R8B. Subset of documents of the R8-Test corpus, a sub-collection of the well-known Reuters-21578 dataset. R8B has the same number of groups as R8-Test (eight groups), but they differ in the number of documents in two specific groups. R8B can be considered as a “balanced” version of R8-Test, with respect to the number of documents per group. Two of the eight groups of R8-Test, contains almost 70% of all the documents in the collection. R8B on the other hand, is intended to provide a collection as similar to R8-Test as possible but fixing this imbalance produced by these two “big” groups. In order to obtain a more balanced collection, those groups were reduced in size by removing a specific number of documents and obtaining in that way a collection without the differences in the size of groups that R8-Test exhibited. Features of R8B: Number of groups = 8, Number of documents = 816, number of terms = 71842, vocabulary size = 5854, (average) number of terms per document = 88.04.

Funcionalidad

This corpus is intended to be used in supervised or unsupervised categorization tasks which mainly involve working with short length texts. The idea in this case was to provide a more balanced variant of R8-Test without the differences in size that two of its groups presented.

Tecnología

The development of this corpus did not require any special development tool beyond the very simple routines to reduce the size of the two biggest groups in R8-Test.

Requisitos técnicos

No special hardware/software is required. Disk space required: 415 Kbytes.

Módulos

Innovación

Unlike R8-Test, an unbalanced document collection, this corpus allows to work with a short-text collection similar to R8-Test but with groups of comparable size.

Desarrollo

MICINN research project TEXT-ENTERPRISE 2.0 TIN2009-13391-C04-03 (Plan I+D+i).
This corpus was generated as part of the Ph.D. work of Diego Ingaramo under the supervision of Marcelo Errecalde (external researcher of TEXT-ENTERPRISE 2.0) and Paolo Rosso.

Publicaciones

Ingaramo D., Cagnina L., Errecalde M., Rosso P. A Particle Swarm Optimizer to cluster short-text corpora: a performance study. In: Proc. Workshop on Natural Language Processing and Web-based Technologies, 12th edition of the Ibero-American Conference on Artificial Intelligence, IBERAMIA-2010, Bahía Blanca, Argentina, November 1-5, pp. 71-79, 2010

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Índice recursos

Descripción

Funcionalidad

Tecnología

Requisitos técnicos

Módulos

Innovación

Desarrollo

Publicaciones