Córpora, Bases de Datos y otros Recursos Lingüísticos

OCA Corpus	OCA es un corpus en árabe sobre comentarios de películas. Este corpus ha sido generado a partir de comentarios en árabe obtenidos de diferentes páginas web que se muestran...
Opinion analysis corpus	The corpus contains 3,000 opinions on the domain of tourism. These opinions have been obtained from the TripAdvisor blog.
SENSEM Corpus	This corpus includes Spanish journalistic texts, more precisely, it is a collection of news extracted from El Periódico de Catalunya. It has been manually annotated at a...
SENSEM Verbal DB	The lexical database contains the most frequent 250 Spanish verbs, a total of 1000 senses. These senses are described from a syntactic and semantic perspective: semantic roles,...
SINAI SA Corpus	Este corpus ha sido preparado por el grupo SINAI en Diciembre de 2008. SINAI SA (Análisis de Sentimientos) fue creado rastreando la página web de Amazon. Casi 2000...
Single-label hep-ex Clustering Corpus	This corpus is a pre-processed version of the collection of scientific abstracts compiled by the University of Jaén, Spain named hep-ex [1].
Social-ODP-2k9	Social-ODP-2k9 is a dataset created during December 2008 and January 2009 with data retrieved from the social bookmarking sites Delicious and StumbleUpon, the Open Directory...
SoCo corpus	Este corpus pertenece a la competición internación SOCO en detección de reutilización de código fuente que se celebra en el forum internación FIRE2014. Consiste en...
Spanish QC	Este recurso son 6305 preguntas en español etiquetadas para clasificación de Búsqueda de Respuestas, siguiendo la taxonomía definida en el artículo “X. Li and D. Roth....
Spanish WordNet 3.0	An open-source lexical and semantic resource for Spanish that has been created from the latest version of the English WordNet (3.0) and connected with it through the ID and the...
Taxonomy-Based Opinion Dataset	This dataset contains annotated reviews for three different domains: cars, headphones and hotels. Opinions are annotated at the feature level, with the following...
The Arabic Wikipedia XML corpus	The 30 most frequent categories of the Arabic Wikipedia XML corpus gathered by Ludovic Denoyer and Patrick Gallinari were selected in order to provide a testbed for the...
The KnCr clustering corpus	This is a new narrow-domain short text corpus in the medicine domain which was constructed by downloading the last sample of documents provided in MEDLINE and selecting only...
Twitter Hash tags Corpus	Corpus containing 50,000 textes extracted from Twitter. Each text contains an hash tag depending on the topic: #humor, #irony, #politics, #technology, #education
Volem	VOLEM (Verbs: Multilingual Lexical Organization) is a lexical multilingual data base of a subset of Spanish, Catalan and French verbs. In this multilingual resource,...
Wiki10+	Wiki10+ is a dataset created during April 2009 with data retrieved from the social bookmarking site Delicious and Wikipedia. It is made up by 20,764 articles of the English...