Autores: | Diego Ingaramo, Marcelo Errecalde (Universidad Nacional de San Luis (Argentina)), Paolo Rosso |
URL: | https://sites.google.com/site/merrecalde/resources |
Contacto: | Marcelo Errecalde <merrecaunsl.edu.ar>, Paolo Rosso <prossodsic.upv.es> |
Descripción
Corpus R8-. Subset of documents of the R8-Test corpus, a sub-collection of the well-known Reuters-21578 dataset. R8- has the same number of groups as R8-Test (eight groups), but they differ in the number of documents per group. Each group of R8- only contains the shortest documents in the corresponding group of R8-Test (a 20% of the documents of each original group). Features of R8-: Number of groups = 8, Number of documents = 445, number of terms = 8481, vocabulary size = 1876, (average) number of terms per document = 19.06.
Funcionalidad
This corpus is intended to be used in supervised or unsupervised categorization tasks which mainly involve working with short length texts. In particular, this collection has been used in studies related to the difficulties that collections with short documents (like R8-) present to clustering algorithms, with respect to arbitrary-size document collections.
Tecnología
The development of this corpus did not require any special development tool beyond the very simple routines to select a 20% of the shortest documents in each R8-Test’s group.
Requisitos técnicos
No special hardware/software is required. Disk space required: 44.3 Kbytes.
Módulos
Innovación
Unlike R8-Test, which contains relatively short documents, this corpus allows to focus on the particularities that present working with extremely short documents.
Desarrollo
- MICINN research project TEXT-ENTERPRISE 2.0 TIN2009-13391-C04-03 (Plan I+D+i).
- This corpus was generated as part of the Ph.D. work of Diego Ingaramo under the supervision of Marcelo Errecalde (external researcher of TEXT-ENTERPRISE 2.0) and Paolo Rosso.
Publicaciones
- Ingaramo D., Errecalde M., Rosso P.A general bio-inspired method to improve the short-text clustering task. In: Proc. 10th Int. Conf. on Comput. Linguistics and Intelligent Text Processing, CICLing-2010, Springer-Verlag, LNCS(6008), pp. 661-672, 2010.
- Errecalde M., Ingaramo D., Rosso P. ITSA*: An Effective Iterative Method for Short-Text Clustering Tasks. In: Proc. 23rd Int. Conf. on Industrial, Engineering & Other Applications of Applied Intelligent Systems , IEA-AIE-2010, Springer-Verlag, LNAI(6096), pp. 550-559, 2010.
- Rosas M., Errecalde M., Rosso P. Un Análisis Comparativo de Estrategias para la Categorización Semántica de Textos Cortos. In: Sociedad Española para el Procesamiento del Lenguaje Natural (SEPLN), num. 44, pp. 11-18, 2010.
- Ingaramo D., Rosas M.V., Errecalde M., Rosso P. Clustering Iterativo de Textos Cortos con Representaciones basadas en Conceptos. In: Proc. Workshop on Natural Language Processing and Web-based Technologies, 12th edition of the Ibero-American Conference on Artificial Intelligence, IBERAMIA-2010, Bahía Blanca, Argentina, November 1-5, pp. 80-89, 2010.