Daniel Pérez (M.Sc. student) and David Pinto
David Eduardo Pinto Avendaño <dpintocs.buap.mx>
This is a set of corpora made up of discussion lines extracted from two blogs websites: boing-boing and slashdot.
The aim of this corpus is to support experiments of supervised and unsupervised classifiers with narrow domain short texts, especifically in the medicine field, with documents related with the “cancer” topic.
The corpus (raw-text blogs) and the gold standard are provided. The discussion lines are intended as categories or classes of the gold standard, whereas posts are the target documents.
No special requirements are needed in order to use the corpus.
The aim of this corpus is to manually verifty the results of different classifiers on the blogs clustering task.
Developed as part of David Pinto Ph.D. and the MiDES CICYT TIN2006-15265-C06-04 research project.