Autores: | David Pinto |
URL: | http://www.dsic.upv.es/grupos/nle/downloads.html |
Contacto: | David Eduardo Pinto Avendaño <dpintocs.buap.mx> |
Descripción
This is a new narrow-domain short text corpus in the medicine domain which was constructed by downloading the last sample of documents provided in MEDLINE and selecting only those which are related with the “Cancer” domain.
Funcionalidad
The aim of this corpus is to support experiments of supervised and unsupervised classifiers with narrow domain short texts, especifically in the medicine field, with documents related with the “cancer” topic.
Tecnología
The corpus (raw text) and the gold standard are provided.
Requisitos técnicos
No special requirements are needed in order to use the corpus.
Módulos
Innovación
To our knowledge, no other corpus of cancer domain has been constructed in order to be used in the categorization task.
Desarrollo
Developed as part of David Pinto Ph.D. and the MiDES CICYT TIN2006-15265-C06-04 research project.
Publicaciones
- David Pinto, Paolo Rosso: KnCr: A Short-Text Narrow-Domain Sub-Corpus of Medline. TLH 2006. Advances in Computer Science: 266-269, 2006.
- David Pinto, Alfons Juan, Paolo Rosso: A Comparative Study of Clustering Algorithms on Narrow-Domain Abstracts. Procesamiento del Lenguaje Natural 37(1): 43-49, 2006.
- David Pinto, José-Miguel Benedí, Paolo Rosso: Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance. CICLing 2007. Lecture Notes in Computer Science 4394, Springer-Verlag: 611-622, 2007.
- David Pinto: On Clustering of Narrow Domain Short-Text Corpora. PhD Thesis, Universidad Politécnica de Valencia, Spain, July 2008.