Antonio Moreno-Sandoval and José María Guirao (LLI-UAM)
Antonio Moreno-Sandoval <antonio.msandovaluam.es>
GRAMPAL is a morphosyntactic tagger based on a large lexicon and with a disambiguation process based on statistical training. It can be adapted to any language register, ie. spontaneous speech, text corpora, or child language. The precision is over the 95% for any register, but the tagger reaches specially good results with spontaneous speech. The service offered combines the automatic tagging with manual revision of the annotation by linguist experts, providing a totally reliable annotation.
From every given input text, GRAMPAL outputs the part-of-speech tagging and lemmatisation of every term. The system is trained both for spontaneous speech and written Spanish.
GRAMPAL is implemented in C++ in a linux platform. The technology is a hybrid system based on a large lexicon and in statistical disambiguation.
The service is obtained through an agreement between both parts. It works like a translation service, that is, the client sends the corpus and the annotated and verified version is returned. This service can be provided both for written and spoken resources. The output can be delivered in any format, ie., XML, plain text and any tagset.
Automatic tagging and manual revision by expert linguists, controlled by devoted tool.
GRAMPAL’s main innovation was obtained when it was used in the tagging of the C-ORAL-ROM corpus, an EU-funded project of spontaneous speech resources. It must be pointed out that GRAMPAL has been specially adapted for spoken Spanish, what means a special training with spoken corpora for the disambiguation of PoS candidates.
GRAMPAL was the result of a PhD dissertation in 1991, and it has been developed for more than 10 years long by a team of linguists and engineers, as a result of the experience gained in several funded research project.
- Moreno, A. & Guirao, J.M. “Morpho-syntactic Tagging of the Spanish C- ORAL-ROM Corpus: Methodology, Tools and Evaluation.”, in Spoken Language Corpus and Linguistic Informatics, John Benjamins, 2006.
- Guirao, J.M. y Moreno, A. A “toolbox” for tagging the Spanish C-ORAL-ROM corpus IV International Conference on Language Resources and Evaluation (LREC2004) Proceedings, 2004.