Córpora, Bases de Datos y otros Recursos Lingüísticos

ADQA (Arabic Definition Question Answering) corpus	ADQA Corpus – Arabic Definition Question Answering corpus. This corpus is constituted of a list of 50 definition questions (ArabicListDefQuest), a set of 50 files...
Amazon Data Sets	This corpus has been created in order to study the figurative language, especially irony, sarcasm and humour, in a context focused on sentiment analysis. It contains approx....
AnCora-Ca	AnCora-Ca is a multilevel annotated corpus of Catalan, consisting of 500,000 words mostly from newspaper articles. AnCora-Ca is annotated with morphological (PoS), syntactic...
AnCora-CO-Ca	AnCora-CO-Ca is a subset of the multilevel annotated corpus AnCora-Ca (for Catalan), consisting of 400,000 words, enriched with coreference information, where...
AnCora-CO-Es	AnCora-CO-Es is a subset of the multilevel annotated corpus AnCora-Es (for Spanish), consisting of 400,000 words, enriched with coreference information, where...
AnCora-DEP-Ca	AnCora-DEP-Ca is the AnCora-Ca multilevel annotated corpus of Catalan in dependency-based representation, consisting of 500,000 words approximately.
AnCora-DEP-Es	AnCora-DEP-Es is the AnCora-Es multilevel annotated corpus of Spanish in dependency-based representation, consisting of 500,000 words approximately.
AnCora-Es	AnCora-Es is a multilevel annotated corpus of Spanish, consisting of 500,000 words mostly from newspaper articles. AnCora-Es is annotated with morphological (PoS), syntactic...
AnCora-Verb-Ca	AnCora-Verb-Ca is a verbal lexicon containing 2,141diferent verbs. In AnCora-Verb-Ca lexicon, the mapping between syntactic functions, arguments and thematic roles of each...
AnCora-Verb-Es	AnCora-Verb-Es is a verbal lexicon containing 2,603 different verbs. In AnCora-Verb-Es lexicon, the mapping between syntactic functions, arguments and thematic roles of each...
ANERcorp	ANERcorp is an Arabic NER corpus which consists of 150,000 tokens (which go up to 200,000 tokens after segmentation).
ANERgazet	ANERgazet is a set of 3 Arabic gazetteers (people, locations and organizations) which might be used mainly for the Arabic NER task, but still can be used for other Arabic NLP...
Arabic QA	This corpus includes Spanish journalistic texts, more precisely, it is a collection of news extracted from El Periódico de Catalunya. It has been manually annotated at a...
Arabic WordNet	The Arabic WordNet (AWN) is a lexical database of the Arabic language following the development process of Princeton English WordNet and Euro WordNet. It utilizes the Suggested...
Author Profiling @ PAN-2013	This corpus consists of ocuments written in both English and Spanish. With regard to age, we will consider posts of three classes: 10s (13-17), 20s (23-27), and 30s (33-47)....
Author Profiling @ PAN-2014	Twitter tweets and social media texts written in both English and Spanish as well as hotel reviews written in English. With regard to age, we will consider the following...
Blogs Analysis corpus	The corpus is integrated by 8 sets. Every set contains 2,400 documents automatically retrieved from LiveJournal and Wikipedia. The corpus is organised as follows: i) The [mfs]...
Blogs Clustering Corpus	This is a set of corpora made up of discussion lines extracted from two blogs websites: boing-boing and slashdot.
CESCA	CESCA is a Catalan corpus consisting of scholar writing text elaborated by 2,400 scholars between the ages of five and sixteen. Each informant has written different types of...
CICLing-2002 Clustering Corpus	This a pre-processed version of 48 scientific abstracts from the CICLing 2002 conference (computational linguistics).