Autores: | Daniel Pérez (M.Sc. student) and David Pinto |
URL: | http://www.dsic.upv.es/grupos/nle/downloads.html |
Contacto: | David Eduardo Pinto Avendaño <dpinto |
Descripción
This is a set of corpora made up of discussion lines extracted from two blogs websites: boing-boing and slashdot.
Funcionalidad
The aim of this corpus is to support experiments of supervised and unsupervised classifiers with narrow domain short texts, especifically in the medicine field, with documents related with the “cancer” topic.
Tecnología
The corpus (raw-text blogs) and the gold standard are provided. The discussion lines are intended as categories or classes of the gold standard, whereas posts are the target documents.
Requisitos técnicos
No special requirements are needed in order to use the corpus.
Módulos
Innovación
The aim of this corpus is to manually verifty the results of different classifiers on the blogs clustering task.
Desarrollo
Developed as part of David Pinto Ph.D. and the MiDES CICYT TIN2006-15265-C06-04 research project.