Autores: | Diego Ingaramo, Marcelo Errecalde (Universidad Nacional de San Luis (Argentina)), Paolo Rosso |
URL: | https://sites.google.com/site/merrecalde/resources |
Contacto: | Marcelo Errecalde <merreca |
Descripción
Corpus EasyAbstracts. This collection can be considered harder than collections of long documents such as Micro4News because its documents are scientific abstracts (same characteristic as CiCling-2002) and, as a consequence, are short documents. It differs from the CiCling-2002 collection with respect to the overlapping degree of the documents’ vocabulary. EasyAbstracts documents also refer to a shared thematic (intelligent systems) but its groups are not so closely related as the CiCling-2002 ones are. EasyAbstracts was constructed with abstracts publicly available on Internet that correspond to articles of four international journals in the following fields: 1) Machine Learning, 2) Heuristics in Optimization, 3) Automated reasoning and 4) Autonomous intelligent agents. It is possible to select abstracts for these disciplines in a way that two abstracts of two different categories are not related at all. However, some degree of complexity can be introduced if abstracts of articles related to two or more EasyAbstracts’ categories are used. EasyAbstracts includes a few documents with these last features in order to increase the complexity with respect to the Micro4News corpus. Nevertheless, a majority of documents in this collection clearly belong to a single group. This last fact allows us to assume that this collection has a lower complexity degree than the CiCling2002 corpus used in different works on short-text clustering. Features of EasyAbstracts: Number of groups = 4, Number of documents = 48, number of terms = 9261, vocabulary size = 2169, (average) number of terms per document = 192.93.
Funcionalidad
This corpus is intended to be used in supervised or unsupervised categorization tasks which mainly involve working with short length texts. The idea in this case was to provide a collection with scientific abstracts but with a lower complexity degree than the CiCling2002 corpus.
Tecnología
The development of this corpus did not require any special development tool. All the documents in this collection were manually selected.
Requisitos técnicos
No special hardware/software is required. Disk space required: 62 Kbytes.
Módulos
Innovación
This corpus allows to work with a small collection that should not be difficult to clustering or categorization purposes. Clustering or categorization algorithms should not have any problem in obtain high quality results with Micro4News. In that way, Micro4News becomes an interesting alternative as baseline collection in comparative studies that involve difficult short-text collections.
Desarrollo
- MICINN research project TEXT-ENTERPRISE 2.0 TIN2009-13391-C04-03 (Plan I+D+i).
- This corpus was generated as part of the Ph.D. work of Diego Ingaramo under the supervision of Marcelo Errecalde (external researcher of TEXT-ENTERPRISE 2.0) and Paolo Rosso.
Publicaciones
- Ingaramo D., Errecalde M., Rosso P.A general bio-inspired method to improve the short-text clustering task. In: Proc. 10th Int. Conf. on Comput. Linguistics and Intelligent Text Processing, CICLing-2010, Springer-Verlag, LNCS(6008), pp. 661-672, 2010.
- Errecalde M., Ingaramo D., Rosso P. ITSA*: An Effective Iterative Method for Short-Text Clustering Tasks. In: Proc. 23rd Int. Conf. on Industrial, Engineering & Other Applications of Applied Intelligent Systems , IEA-AIE-2010, Springer-Verlag, LNAI(6096), pp. 550-559, 2010.
- Ingaramo D., Cagnina L., Errecalde M., Rosso P. A Particle Swarm Optimizer to cluster short-text corpora: a performance study. In: Proc. Workshop on Natural Language Processing and Web-based Technologies, 12th edition of the Ibero-American Conference on Artificial Intelligence, IBERAMIA-2010, Bahía Blanca, Argentina, November 1-5, pp. 71-79, 2010.
- Errecalde M., Ingaramo D., Rosso P. Proximity estimation and the hardness of short-text corpora. In: 5th Workshop on Text-based Information Retrieval, TIR-2008, In: Proc. Database and Expert Systems Applications, DEXA-2008, IEEE Press, Turin, Italy, September 1-5, pp. 15-19, 2008.
- Cagnina L., Errecalde M., Ingaramo D., Rosso P. A discrete particle Swarm optimizer for clustering short-text corpora. In: Bioinspired Optimization Methods and their Applications, BIOMA-2008, Ljubljana, Slovenia, October 13-14, pp. 93-103, 2008.