Systemized Process of Corpora Development

Autores:	Marta Garrote y Antonio Moreno-Sandoval (LLI-UAM)
URL:	http://www.lllf.uam.es/
Contacto:	Antonio Moreno-Sandoval <antonio.msandovaluam.es>

Descripción

Systemized process to collect both spontaneous speech and written corpora composed of the following stages (each stage is manually revised by more than one person):

Preliminary design considering participants, their socio-linguistic features (age, gender, demographics, linguistic origin, education, etc) and the communicative context. This information may be modified depending on the goals of the study. This design may also be modified according to the variables consiered in the study.
Data collecting (recording, video captures, editing, etc.)
Orthographic transcription (both normative and real speech).
Prosodic annotation, marking pauses, vocal lengthening, overlaps, interruptions, intonation, etc.
Alignment of text-sound units in utterances.
Semi-automatic morpho-syntact c annotation (part-of-speech and lemmas).
Automatic phonological annotation.

Funcionalidad

Besides the possible application of these data collections, this methodology allows automatic information processing and retrieval at each linguistic level, since all annotations are standardized using XML.

Tecnología

The complete process involves different technologies such as word sense disambiguation, part-of-speech tagging and lemmatization.

Requisitos técnicos

This is a service accessible after signing an agreement or contract with LLI-UAM.

Módulos

Innovación

This service is presented as a result of different R&D projects. Each project focused on the development of one level of analysis, obtaining a complete toolkit. The added value is the systemized methodology that has been successfully proved in the elaboration of different customized corpora.

Desarrollo

The work has been mainly supported by public funding through research projects. The methodology was acquired during the C-ORAL-ROM corpus, a EU-funded project of the 5FP.

Publicaciones

Cresti, E. Moneglia, M .(eds). 2005. C-ORAL-ROM: Integrated Reference Corpora for Spoken Roman Languages. Amsterdam. John Benjamins.
Garrote. M. CHIEDE: Corpus de habla infantil espontánea del español. PhD Dissertation. Universidad Autónoma de Madrid. 2008.

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Red Temática en Tratamiento de la Información Multilingüe y Multimodal (TIMM)

Índice recursos