CL!NSS PAN@FIRE corpus | The corpus contains source (Hindi) and target (English) news stories partition. The documents are marked up with news story metadata such as title, date of publication and... |
CL!TR corpus | The corpus contains a set of potential source documents D, written in English, and set of suspicious documents S, written in Hindi. In the corpus you will find plain text files... |
CLPD | Este corpus ha sido utilizado en los experimentos de detección de plagio translingüe realizados en esta publicación : Barrón-Cedeño A., Gupta P., Rosso P. Methods for... |
Co-derivatives corpus | This corpus has been generated for the analysis of co-derivatives, text reuse and plagiarism (of course, simulated). It is composed of more than 20,000 documents from Wikipedia... |
Colección HEP | Este corpus está orientado al estudio de clasificadores de texto multi-etiquetado. Está compuesto por artículos científicos en el área de la Física de Altas Energías... |
Computer Science Trilingual Corpus | This corpus was developed as part of a project for teaching innovation whose objective was the improvement of the processes for teaching/learning technical English by using a... |
Corpus EasyAbstracts | Corpus EasyAbstracts. This collection can be considered harder than collections of long documents such as Micro4News because its documents are scientific abstracts (same... |
Corpus Micro4News | Corpus Micro4News. The Micro4News collection was constructed with medium-length documents that correspond to four very different topics of the popular 20Newsgroups corpus: 1)... |
Corpus Plagiarism Competition PAN-PC-2010 | This corpus contains documents in which artificial plagiarism has been inserted automatically: 8.4 GB, 162,000 cases of plagiarism |
Corpus R8+ | Corpus R8+. Subset of documents of the R8-Test corpus, a sub-collection of the well-known Reuters-21578 dataset. R8+ has the same number of groups as R8-Test (eight groups),... |
Corpus R8- | Corpus R8-. Subset of documents of the R8-Test corpus, a sub-collection of the well-known Reuters-21578 dataset. R8- has the same number of groups as R8-Test (eight groups),... |
Corpus R8B | Corpus R8B. Subset of documents of the R8-Test corpus, a sub-collection of the well-known Reuters-21578 dataset. R8B has the same number of groups as R8-Test (eight groups),... |
Cross-Lingual Plagiarism Corpus | The CliPA corpus has been created as a resource for the design and test of methods for the automatic detection of cross-lingual plagiarism cases. It contains a set of original... |
DDI corpus | DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions. The management of drug-drug interactions (DDIs) is a critical issue resulting... |
DeliciousT140 | DeliciousT140: Colección de 144.574 documentos web en inglés, con su correspondiente información de tags extraída de Delicious en junio de 2008, a partir de los feeds... |
Diccionario de colocaciones del Español (DiCE) | El Diccionario de Colocaciones del Español es un diccionario que proporciona información sobre la coocurrencia restringida de las palabras del español, de manera similar que... |
DrugNer | DrugNer Corpus: a corpus annotated with generic drug names and other biomedical concepts by ISABEL SEGURA-BEDMAR is licensed under a Creative Commons Reconocimiento-No... |
DrugNerAr corpus | There is no corpus dedicated to the resolution of the anaphoric expressions occurring in drug interaction descriptions in pharmacological documents. A collection of 49... |
EDBL lexical database | EDBL (Euskararen Datu-Base Lexikala) is a general-purpose lexical database used in Basque text-processing tasks. It is a large repository of lexical knowledge (currently around... |
EmIroGeFB | Corpus de comentarios de Facebook en español sobre 3 dominios (política, fútbol, celebrities) que ha sido etiquetado con las 6 emociones básicas joy, surprise, fear, anger,... |