Marta Garrote y Antonio Moreno-Sandoval (LLI-UAM)
Antonio Moreno-Sandoval <antonio.msandovaluam.es>
Systemized process to collect both spontaneous speech and written corpora composed of the following stages (each stage is manually revised by more than one person):
- Preliminary design considering participants, their socio-linguistic features (age, gender, demographics, linguistic origin, education, etc) and the communicative context. This information may be modified depending on the goals of the study. This design may also be modified according to the variables consiered in the study.
- Data collecting (recording, video captures, editing, etc.)
- Orthographic transcription (both normative and real speech).
- Prosodic annotation, marking pauses, vocal lengthening, overlaps, interruptions, intonation, etc.
- Alignment of text-sound units in utterances.
- Semi-automatic morpho-syntact c annotation (part-of-speech and lemmas).
- Automatic phonological annotation.
Besides the possible application of these data collections, this methodology allows automatic information processing and retrieval at each linguistic level, since all annotations are standardized using XML.
The complete process involves different technologies such as word sense disambiguation, part-of-speech tagging and lemmatization.
This is a service accessible after signing an agreement or contract with LLI-UAM.
This service is presented as a result of different R&D projects. Each project focused on the development of one level of analysis, obtaining a complete toolkit. The added value is the systemized methodology that has been successfully proved in the elaboration of different customized corpora.
The work has been mainly supported by public funding through research projects. The methodology was acquired during the C-ORAL-ROM corpus, a EU-funded project of the 5FP.
- Cresti, E. Moneglia, M .(eds). 2005. C-ORAL-ROM: Integrated Reference Corpora for Spoken Roman Languages. Amsterdam. John Benjamins.
- Garrote. M. CHIEDE: Corpus de habla infantil espontánea del español. PhD Dissertation. Universidad Autónoma de Madrid. 2008.