AnCora-Ca

Autores:	M. Antònia Martí, Mariona Taulé, Lluís Màrquez and Manuel Bertran (CLiC-UB)
URL:	http://clic.ub.edu/ancora
Contacto:	M. Antònia Martí <amartiub.edu>

Descripción

AnCora-Ca is a multilevel annotated corpus of Catalan, consisting of 500,000 words mostly from newspaper articles. AnCora-Ca is annotated with morphological (PoS), syntactic (constituents and functions) and semantic (argument structure and thematic roles, semantic class, named entities and WordNet senses) information. All resulting layers are independent of each other, thus making easier the data management. The annotation was performed manually, semiautomatically, or fully automatically, depending on the encoded linguistic information.

Funcionalidad

Annotated corpora constitute a crucial resource to acquire or infer linguistic knowledge about how languages are used. In this line, AnCora-Ca is a very useful resource for computational and linguistic analysis of language, especially necessary for machine learning systems. This corpus is used as source of information for developing POS taggers, syntactic parsers and, Semantic Role Labelling, Word Sense Disambiguation, Named Entity Recognition and Classification systems. This corpus was used in the SemEval 2007 task: Multilevel Semantic Annotation of Catalan and Spanish.

Tecnología

Data stored in XML format

Requisitos técnicos

Módulos

Innovación

At present AnCora-Es is the largest Spanish corpus annotated at all the linguistic levels described above freely available.

Desarrollo

The development of AnCora-Ca has been funded by the following projects: 3LB (FIT-150-500-2002-244), CESS-ECE (HUM2004-21127), PRAXEM (HUM2006-27378-E), and Lang2World (TIN2006-15265-C06-06) from the Spanish Ministry of Education and Science, and the funding given by the Catalan Secretary of Linguistic Policy.

Publicaciones

Taulé, M., M.A. Martí, M. Recasens (2008) Ancora: Multilevel Annotated Corpora for Catalan and Spanish. Proceedings of 6th International Conference on Language Resources and Evaluation. Marrakesh (Morocco).

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Índice recursos

Descripción

Funcionalidad

Tecnología

Requisitos técnicos

Módulos

Innovación

Desarrollo

Publicaciones