Corpus R8-

Autores:	Diego Ingaramo, Marcelo Errecalde (Universidad Nacional de San Luis (Argentina)), Paolo Rosso
URL:	https://sites.google.com/site/merrecalde/resources
Contacto:	Marcelo Errecalde <merrecaunsl.edu.ar>, Paolo Rosso <prossodsic.upv.es>

Descripción

Corpus R8-. Subset of documents of the R8-Test corpus, a sub-collection of the well-known Reuters-21578 dataset. R8- has the same number of groups as R8-Test (eight groups), but they differ in the number of documents per group. Each group of R8- only contains the shortest documents in the corresponding group of R8-Test (a 20% of the documents of each original group). Features of R8-: Number of groups = 8, Number of documents = 445, number of terms = 8481, vocabulary size = 1876, (average) number of terms per document = 19.06.

Funcionalidad

This corpus is intended to be used in supervised or unsupervised categorization tasks which mainly involve working with short length texts. In particular, this collection has been used in studies related to the difficulties that collections with short documents (like R8-) present to clustering algorithms, with respect to arbitrary-size document collections.

Tecnología

The development of this corpus did not require any special development tool beyond the very simple routines to select a 20% of the shortest documents in each R8-Test’s group.

Requisitos técnicos

No special hardware/software is required. Disk space required: 44.3 Kbytes.

Módulos

Innovación

Unlike R8-Test, which contains relatively short documents, this corpus allows to focus on the particularities that present working with extremely short documents.

Desarrollo

MICINN research project TEXT-ENTERPRISE 2.0 TIN2009-13391-C04-03 (Plan I+D+i).
This corpus was generated as part of the Ph.D. work of Diego Ingaramo under the supervision of Marcelo Errecalde (external researcher of TEXT-ENTERPRISE 2.0) and Paolo Rosso.

Publicaciones

Ingaramo D., Errecalde M., Rosso P.A general bio-inspired method to improve the short-text clustering task. In: Proc. 10th Int. Conf. on Comput. Linguistics and Intelligent Text Processing, CICLing-2010, Springer-Verlag, LNCS(6008), pp. 661-672, 2010.
Errecalde M., Ingaramo D., Rosso P. ITSA*: An Effective Iterative Method for Short-Text Clustering Tasks. In: Proc. 23rd Int. Conf. on Industrial, Engineering & Other Applications of Applied Intelligent Systems , IEA-AIE-2010, Springer-Verlag, LNAI(6096), pp. 550-559, 2010.
Rosas M., Errecalde M., Rosso P. Un Análisis Comparativo de Estrategias para la Categorización Semántica de Textos Cortos. In: Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN), num. 44, pp. 11-18, 2010.
Ingaramo D., Rosas M.V., Errecalde M., Rosso P. Clustering Iterativo de Textos Cortos con Representaciones basadas en Conceptos. In: Proc. Workshop on Natural Language Processing and Web-based Technologies, 12th edition of the Ibero-American Conference on Artificial Intelligence, IBERAMIA-2010, Bahía Blanca, Argentina, November 1-5, pp. 80-89, 2010.

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Índice recursos

Descripción

Funcionalidad

Tecnología

Requisitos técnicos

Módulos

Innovación

Desarrollo

Publicaciones