Córpora, Bases de Datos y otros Recursos Lingüísticos

CL!NSS PAN@FIRE corpus	The corpus contains source (Hindi) and target (English) news stories partition. The documents are marked up with news story metadata such as title, date of publication and...
CL!TR corpus	The corpus contains a set of potential source documents D, written in English, and set of suspicious documents S, written in Hindi. In the corpus you will find plain text files...
CLPD	Este corpus ha sido utilizado en los experimentos de detección de plagio translingüe realizados en esta publicación : Barrón-Cedeño A., Gupta P., Rosso P. Methods for...
Co-derivatives corpus	This corpus has been generated for the analysis of co-derivatives, text reuse and plagiarism (of course, simulated). It is composed of more than 20,000 documents from Wikipedia...
Colección HEP	Este corpus está orientado al estudio de clasificadores de texto multi-etiquetado. Está compuesto por artículos científicos en el área de la Física de Altas Energías...
Computer Science Trilingual Corpus	This corpus was developed as part of a project for teaching innovation whose objective was the improvement of the processes for teaching/learning technical English by using a...
Corpus EasyAbstracts	Corpus EasyAbstracts. This collection can be considered harder than collections of long documents such as Micro4News because its documents are scientific abstracts (same...
Corpus Micro4News	Corpus Micro4News. The Micro4News collection was constructed with medium-length documents that correspond to four very different topics of the popular 20Newsgroups corpus: 1)...
Corpus Plagiarism Competition PAN-PC-2010	This corpus contains documents in which artificial plagiarism has been inserted automatically: 8.4 GB, 162,000 cases of plagiarism
Corpus R8+	Corpus R8+. Subset of documents of the R8-Test corpus, a sub-collection of the well-known Reuters-21578 dataset. R8+ has the same number of groups as R8-Test (eight groups),...
Corpus R8-	Corpus R8-. Subset of documents of the R8-Test corpus, a sub-collection of the well-known Reuters-21578 dataset. R8- has the same number of groups as R8-Test (eight groups),...
Corpus R8B	Corpus R8B. Subset of documents of the R8-Test corpus, a sub-collection of the well-known Reuters-21578 dataset. R8B has the same number of groups as R8-Test (eight groups),...
Cross-Lingual Plagiarism Corpus	The CliPA corpus has been created as a resource for the design and test of methods for the automatic detection of cross-lingual plagiarism cases. It contains a set of original...
DDI corpus	DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions. The management of drug-drug interactions (DDIs) is a critical issue resulting...
DeliciousT140	DeliciousT140: Colección de 144.574 documentos web en inglés, con su correspondiente información de tags extraída de Delicious en junio de 2008, a partir de los feeds...
Diccionario de colocaciones del Español (DiCE)	El Diccionario de Colocaciones del Español es un diccionario que proporciona información sobre la coocurrencia restringida de las palabras del español, de manera similar que...
DrugNer	DrugNer Corpus: a corpus annotated with generic drug names and other biomedical concepts by ISABEL SEGURA-BEDMAR is licensed under a Creative Commons Reconocimiento-No...
DrugNerAr corpus	There is no corpus dedicated to the resolution of the anaphoric expressions occurring in drug interaction descriptions in pharmacological documents. A collection of 49...
EDBL lexical database	EDBL (Euskararen Datu-Base Lexikala) is a general-purpose lexical database used in Basque text-processing tasks. It is a large repository of lexical knowledge (currently around...
EmIroGeFB	Corpus de comentarios de Facebook en español sobre 3 dominios (política, fútbol, celebrities) que ha sido etiquetado con las 6 emociones básicas joy, surprise, fear, anger,...

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Índice recursos