Autores: | Diego Ingaramo, Marcelo Errecalde (Universidad Nacional de San Luis (Argentina)), Paolo Rosso |
URL: | https://sites.google.com/site/merrecalde/resources |
Contacto: | Marcelo Errecalde <merrecaunsl.edu.ar>, Paolo Rosso <prossodsic.upv.es> |
Descripción
Corpus R8+. Subset of documents of the R8-Test corpus, a sub-collection of the well-known Reuters-21578 dataset. R8+ has the same number of groups as R8-Test (eight groups), but they differ in the number of documents per group. Each group of R8+ only contains the largest documents in the corresponding group of R8-Test (a 20% of the documents of each original group). Features of R8+: Number of groups = 8, Number of documents = 445, number of terms = 66314, vocabulary size = 7797, (average) number of terms per document = 149.02.
Funcionalidad
This corpus is intended to be used in supervised or unsupervised categorization tasks which mainly involve working with short length texts. However, R8+ only contains the largest documents in R8-Test in order to analyze how difficult a collection with these particularities is, with respect to collections with “very short” length documents like R8-, a collection similar to R8+, but generated with the shortest documents of R8-Test. Both collections were simultaneously generated in previous studies to consider the “shortest length” and the “largest length” versions of R8-Test.
Tecnología
The development of this corpus did not require any special development tool beyond the very simple routines to select a 20% of the largest documents in each R8-Test’s group.
Requisitos técnicos
No special hardware/software is required. Disk space required: 440 Kbytes.
Módulos
Innovación
Unlike R8-Test, which contains relatively short documents, this corpus allows to focus on the particularities that present working with the largest documents of this collection. This study, is usually complemented with results obtained with R8-, which was generated with the shortest documents of R8-Test.
Desarrollo
- MICINN research project TEXT-ENTERPRISE 2.0 TIN2009-13391-C04-03 (Plan I+D+i).
- This corpus was generated as part of the Ph.D. work of Diego Ingaramo under the supervision of Marcelo Errecalde (external researcher of TEXT-ENTERPRISE 2.0) and Paolo Rosso.
Publicaciones
- Ingaramo D., Errecalde M., Rosso P.A general bio-inspired method to improve the short-text clustering task. In: Proc. 10th Int. Conf. on Comput. Linguistics and Intelligent Text Processing, CICLing-2010, Springer-Verlag, LNCS(6008), pp. 661-672, 2010.
- Errecalde M., Ingaramo D., Rosso P. ITSA*: An Effective Iterative Method for Short-Text Clustering Tasks. In: Proc. 23rd Int. Conf. on Industrial, Engineering & Other Applications of Applied Intelligent Systems , IEA-AIE-2010, Springer-Verlag, LNAI(6096), pp. 550-559, 2010.
- Rosas M., Errecalde M., Rosso P. Un Análisis Comparativo de Estrategias para la Categorización Semántica de Textos Cortos. In: Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN), num. 44, pp. 11-18, 2010.
- Ingaramo D., Rosas M.V., Errecalde M., Rosso P. Clustering Iterativo de Textos Cortos con Representaciones basadas en Conceptos. In: Proc. Workshop on Natural Language Processing and Web-based Technologies, 12th edition of the Ibero-American Conference on Artificial Intelligence, IBERAMIA-2010, Bahía Blanca, Argentina, November 1-5, pp. 80-89, 2010.