Córpora, Bases de Datos y otros Recursos Lingüísticos


OCA Corpus

OCA es un corpus en árabe sobre comentarios de películas. Este corpus ha sido generado a partir de comentarios en árabe obtenidos de diferentes páginas web que se muestran...

Opinion analysis corpus

The corpus contains 3,000 opinions on the domain of tourism. These opinions have been obtained from the TripAdvisor blog.

SENSEM Corpus

This corpus includes Spanish journalistic texts, more precisely, it is a collection of news extracted from El Periódico de Catalunya. It has been manually annotated at a...

SENSEM Verbal DB

The lexical database contains the most frequent 250 Spanish verbs, a total of 1000 senses. These senses are described from a syntactic and semantic perspective: semantic roles,...

SINAI SA Corpus

Este corpus ha sido preparado por el grupo SINAI en Diciembre de 2008. SINAI SA (Análisis de Sentimientos) fue creado rastreando la página web de Amazon. Casi 2000...

Single-label hep-ex Clustering Corpus

This corpus is a pre-processed version of the collection of scientific abstracts compiled by the University of Jaén, Spain named hep-ex [1].

Social-ODP-2k9

Social-ODP-2k9 is a dataset created during December 2008 and January 2009 with data retrieved from the social bookmarking sites Delicious and StumbleUpon, the Open Directory...

SoCo corpus

Este corpus pertenece a la competición internación SOCO en detección de reutilización de código fuente que se celebra en el forum internación FIRE2014. Consiste en...

Spanish QC

Este recurso son 6305 preguntas en español etiquetadas para clasificación de Búsqueda de Respuestas, siguiendo la taxonomía definida en el artículo “X. Li and D. Roth....

Spanish WordNet 3.0

An open-source lexical and semantic resource for Spanish that has been created from the latest version of the English WordNet (3.0) and connected with it through the ID and the...

Taxonomy-Based Opinion Dataset

This dataset contains annotated reviews for three different domains: cars, headphones and hotels. Opinions are annotated at the feature level, with the following...

The Arabic Wikipedia XML corpus

The 30 most frequent categories of the Arabic Wikipedia XML corpus gathered by Ludovic Denoyer and Patrick Gallinari were selected in order to provide a testbed for the...

The KnCr clustering corpus

This is a new narrow-domain short text corpus in the medicine domain which was constructed by downloading the last sample of documents provided in MEDLINE and selecting only...

Twitter Hash tags Corpus

Corpus containing 50,000 textes extracted from Twitter. Each text contains an hash tag depending on the topic: #humor, #irony, #politics, #technology, #education

Volem

VOLEM (Verbs: Multilingual Lexical Organization) is a lexical multilingual data base of a subset of Spanish, Catalan and French verbs. In this multilingual resource,...

Wiki10+

Wiki10+ is a dataset created during April 2009 with data retrieved from the social bookmarking site Delicious and Wikipedia. It is made up by 20,764 articles of the English...