M. Antònia Martí, Mariona Taulé, Lluís Màrquez and Manuel Bertran (CLiC-UB)
M. Antònia Martí <amartiub.edu>
AnCora-Ca is a multilevel annotated corpus of Catalan, consisting of 500,000 words mostly from newspaper articles. AnCora-Ca is annotated with morphological (PoS), syntactic (constituents and functions) and semantic (argument structure and thematic roles, semantic class, named entities and WordNet senses) information. All resulting layers are independent of each other, thus making easier the data management. The annotation was performed manually, semiautomatically, or fully automatically, depending on the encoded linguistic information.
Annotated corpora constitute a crucial resource to acquire or infer linguistic knowledge about how languages are used. In this line, AnCora-Ca is a very useful resource for computational and linguistic analysis of language, especially necessary for machine learning systems. This corpus is used as source of information for developing POS taggers, syntactic parsers and, Semantic Role Labelling, Word Sense Disambiguation, Named Entity Recognition and Classification systems. This corpus was used in the SemEval 2007 task: Multilevel Semantic Annotation of Catalan and Spanish.
Data stored in XML format
At present AnCora-Es is the largest Spanish corpus annotated at all the linguistic levels described above freely available.
The development of AnCora-Ca has been funded by the following projects: 3LB (FIT-150-500-2002-244), CESS-ECE (HUM2004-21127), PRAXEM (HUM2006-27378-E), and Lang2World (TIN2006-15265-C06-06) from the Spanish Ministry of Education and Science, and the funding given by the Catalan Secretary of Linguistic Policy.
Taulé, M., M.A. Martí, M. Recasens (2008) Ancora: Multilevel Annotated Corpora for Catalan and Spanish. Proceedings of 6th International Conference on Language Resources and Evaluation. Marrakesh (Morocco).