| CL!NSS PAN@FIRE corpus | The corpus contains source (Hindi) and target (English) news stories partition. The documents are marked up with news story metadata such as title, date of publication and... |
| CL!TR corpus | The corpus contains a set of potential source documents D, written in English, and set of suspicious documents S, written in Hindi. In the corpus you will find plain text files... |
| CLPD | Este corpus ha sido utilizado en los experimentos de detección de plagio translingüe realizados en esta publicación : Barrón-Cedeño A., Gupta P., Rosso P. Methods for... |
| Co-derivatives corpus | This corpus has been generated for the analysis of co-derivatives, text reuse and plagiarism (of course, simulated). It is composed of more than 20,000 documents from Wikipedia... |
| Colección HEP | Este corpus está orientado al estudio de clasificadores de texto multi-etiquetado. Está compuesto por artículos científicos en el área de la Física de Altas Energías... |
| Computer Science Trilingual Corpus | This corpus was developed as part of a project for teaching innovation whose objective was the improvement of the processes for teaching/learning technical English by using a... |
| Corpus EasyAbstracts | Corpus EasyAbstracts. This collection can be considered harder than collections of long documents such as Micro4News because its documents are scientific abstracts (same... |
| Corpus Micro4News | Corpus Micro4News. The Micro4News collection was constructed with medium-length documents that correspond to four very different topics of the popular 20Newsgroups corpus: 1)... |
| Corpus Plagiarism Competition PAN-PC-2010 | This corpus contains documents in which artificial plagiarism has been inserted automatically: 8.4 GB, 162,000 cases of plagiarism |
| Corpus R8+ | Corpus R8+. Subset of documents of the R8-Test corpus, a sub-collection of the well-known Reuters-21578 dataset. R8+ has the same number of groups as R8-Test (eight groups),... |
| Corpus R8- | Corpus R8-. Subset of documents of the R8-Test corpus, a sub-collection of the well-known Reuters-21578 dataset. R8- has the same number of groups as R8-Test (eight groups),... |
| Corpus R8B | Corpus R8B. Subset of documents of the R8-Test corpus, a sub-collection of the well-known Reuters-21578 dataset. R8B has the same number of groups as R8-Test (eight groups),... |
| Cross-Lingual Plagiarism Corpus | The CliPA corpus has been created as a resource for the design and test of methods for the automatic detection of cross-lingual plagiarism cases. It contains a set of original... |
| DDI corpus | DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions. The management of drug-drug interactions (DDIs) is a critical issue resulting... |
| DeliciousT140 | DeliciousT140: Colección de 144.574 documentos web en inglés, con su correspondiente información de tags extraída de Delicious en junio de 2008, a partir de los feeds... |
| Diccionario de colocaciones del Español (DiCE) | El Diccionario de Colocaciones del Español es un diccionario que proporciona información sobre la coocurrencia restringida de las palabras del español, de manera similar que... |
| DrugNer | DrugNer Corpus: a corpus annotated with generic drug names and other biomedical concepts by ISABEL SEGURA-BEDMAR is licensed under a Creative Commons Reconocimiento-No... |
| DrugNerAr corpus | There is no corpus dedicated to the resolution of the anaphoric expressions occurring in drug interaction descriptions in pharmacological documents. A collection of 49... |
| EDBL lexical database | EDBL (Euskararen Datu-Base Lexikala) is a general-purpose lexical database used in Basque text-processing tasks. It is a large repository of lexical knowledge (currently around... |
| EmIroGeFB | Corpus de comentarios de Facebook en español sobre 3 dominios (política, fútbol, celebrities) que ha sido etiquetado con las 6 emociones básicas joy, surprise, fear, anger,... |
