Eustagger is a robust and wide-coverage morphological analyser and a Part-of-Speech tagger for Basque. The analyser is based on the two-level formalism and has been designed in an incremental way with three main modules: the standard analyser, the analyser of linguistic variants, and the analyser without lexicon which can recognize word-forms without having their lemmas in the lexicon. Using lexical transducers for our analyser we have improved both the performance of the different components of the system and the description itself. Provides possible lemmas, PoS and other morphological information for a token. It also recognizes date/time expressions, numbers. In the tagger combination of stochastic and rule-based disambiguation methods are applied to Basque language. The methods we have used in disambiguation are Constraint Grammar formalism and an HMM based tagger. CG rules are applied using all the morphological features and this process decreases morphological ambiguity of texts. Finally, we use the stochastic tool to select just one from the possible remaining tags. Using only the stochastic method the error rate is about 14%, but the accuracy may be increased by about 2% enriching the lexicon with the unknown words. When both methods are combined, the error rate of the whole process is 3.5%.
Tokenization, morphological analysis, lemmatization and tagging for Basque. There is a web service.
C++ using FSM technology from Xerox and CG library from Connexor
4 main modules: tokenizer, morphological analyzer, rule-based disambiguation and HMM based disambiguation.
Is the analyzer/tagger for Basque.
Different projects funded by the Basque government and the Spanish R&D agency.
- Alegria I., Artola X., Sarasola K., Urkia M. 1996. Automatic morphological analysis of Basque Literary & Linguistic Computing Vol. 11, No. 4, 193-203. Oxford University Press. Oxford.
- Alegria I., Aranzabe M., Ezeiza A., Ezeiza N., Urizar R. 2002 Robustness and customisation in an analyser/lemmatiser for Basque. LREC-2002 Customizing knowledge in NLP applications Workshop.
- Aduriz I., Alegria I., Arriola J.M., Urizar R. 1998. Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages COLING-ACL’98, Montreal (Canada). August 10-14, 1998.