Autores: | Diego Ingaramo, Marcelo Errecalde (Universidad Nacional de San Luis (Argentina)), Paolo Rosso |
URL: | https://sites.google.com/site/merrecalde/resources |
Contacto: | Marcelo Errecalde <merrecaunsl.edu.ar>, Paolo Rosso <prossodsic.upv.es> |
Descripción
Corpus Micro4News. The Micro4News collection was constructed with medium-length documents that correspond to four very different topics of the popular 20Newsgroups corpus: 1) sci.med, 2) soc.religion.christian, 3) rec.autos and 4) comp.os.ms-windows.misc. For each topic, the largest documents in the corresponding group were selected. Thus, the length of the selected documents was, on average, seven times (or more) the length of the abstracts of corpora such as EasyAbstracts and CICLing-2002 which were usually used in comparative studies with Micro4News. Features of Micro4News: Number of groups = 4, Number of documents = 48, number of terms = 125614, vocabulary size = 12785, (average) number of terms per document = 2616.95.
Funcionalidad
This corpus is intended to be used in supervised or unsupervised categorization tasks which mainly involve working with short length texts. The idea in this case was to provide a low complexity small collection with well differentiated categories and relatively long documents. It has been used in comparative studies with high complexity small collections which consist of short texts (for example, the CICLing-2002 collection of scientific abstracts).
Tecnología
The development of this corpus did not require any special development tool. All the documents in this collection were manually selected.
Requisitos técnicos
No special hardware/software is required. Disk space required: 706 Kbytes.
Módulos
Innovación
This corpus allows to work with a small collection that should not be difficult to clustering or categorization purposes. Clustering or categorization algorithms should not have any problem in obtain high quality results with Micro4News. In that way, Micro4News becomes an interesting alternative as baseline collection in comparative studies that involve difficult short-text collections.
Desarrollo
- MICINN research project TEXT-ENTERPRISE 2.0 TIN2009-13391-C04-03 (Plan I+D+i).
- This corpus was generated as part of the Ph.D. work of Diego Ingaramo under the supervision of Marcelo Errecalde (external researcher of TEXT-ENTERPRISE 2.0) and Paolo Rosso.
Publicaciones
- Ingaramo D., Errecalde M., Rosso P.A general bio-inspired method to improve the short-text clustering task. In: Proc. 10th Int. Conf. on Comput. Linguistics and Intelligent Text Processing, CICLing-2010, Springer-Verlag, LNCS(6008), pp. 661-672, 2010
- Errecalde M., Ingaramo D., Rosso P. ITSA*: An Effective Iterative Method for Short-Text Clustering Tasks. In: Proc. 23rd Int. Conf. on Industrial, Engineering & Other Applications of Applied Intelligent Systems , IEA-AIE-2010, Springer-Verlag, LNAI(6096), pp. 550-559, 2010
- Ingaramo D., Cagnina L., Errecalde M., Rosso P. A Particle Swarm Optimizer to cluster short-text corpora: a performance study. In: Proc. Workshop on Natural Language Processing and Web-based Technologies, 12th edition of the Ibero-American Conference on Artificial Intelligence, IBERAMIA-2010, Bahía Blanca, Argentina, November 1-5, pp. 71-79, 2010
- Ingaramo D., Rosas M.V., Errecalde M., Rosso P. Clustering Iterativo de Textos Cortos con Representaciones basadas en Conceptos. In: Proc. Workshop on Natural Language Processing and Web-based Technologies, 12th edition of the Ibero-American Conference on Artificial Intelligence, IBERAMIA-2010, Bahía Blanca, Argentina, November 1-5, pp. 80-89, 2010
- Errecalde M., Ingaramo D., Rosso P. Proximity estimation and the hardness of short-text corpora. In: 5th Workshop on Text-based Information Retrieval, TIR-2008, In: Proc. Database and Expert Systems Applications, DEXA-2008, IEEE Press, Turin, Italy, September 1-5, pp. 15-19, 2008
- Cagnina L., Errecalde M., Ingaramo D., Rosso P. A discrete particle Swarm optimizer for clustering short-text corpora. In: Bioinspired Optimization Methods and their Applications, BIOMA-2008, Ljubljana, Slovenia, October 13-14, pp. 93-103, 2008