Autores: | David Pinto |
URL: | http://nlp.cs.buap.mx/watermarker/ |
Contacto: | David Eduardo Pinto Avendaño <dpinto |
Descripción
The Watermarking Corpora On-line System (WaCOS) is made up of a set of measures for the assessment of text corpora.
Funcionalidad
WaCOS allows linguistics and computational linguistics researchers to study the following corpus features: domain broadness, shortness, class imbalance, stylometry and structure. WaCOS provides a friendly interface in order to easily evaluate corpora.
Tecnología
WaCOS front-end has been programmed with PHP. It integrates a set of modules written in different programming languages (C, C++, Java, AWK). Among the several components of this system, it uses n-gram language modelling, Zipf distribution of frequencies, density-based measures, internal clustering validity measures, etc in order to assess the relative hardness of a given corpus.
Requisitos técnicos
The end user is only required of an Internet browser in order to access the on-line system.
Módulos
Innovación
A freely available web-based tool which may be used to study peculiarities of textual corpus features.
Desarrollo
Developed as part of David Pinto’s Ph.D. and the MiDES CICYT TIN2006-15265-C06-04 research project.
Publicaciones
- David Pinto: On Clustering of Narrow Domain Short-Text Corpora. PhD Thesis, Universidad Politécnica de Valencia, Spain, July 2008.
- Diego Ingaramo, David Pinto, Paolo Rosso, Marcelo Errecalde: Evaluation of Internal Validity Measures in Short-Text Corpora. CICLing 2008. Lecture Notes in Computer Science 4919, Springer-Verlag: 555-567, 2008.
- Rafael Guzman, Manuel Montes, Paolo Rosso, Luis Villaseñor-Pineda and David Pinto: Semi-supervised Approach for WSD using the Web as Corpus. CICLing 2009. Lecture Notes in Computer Science, Springer-Verlag, 2009.