Open Lab 2006

TC-STAR OpenLab on Speech Translation
Trento, Italy 30th March - 1st April 2006

Detailed description

Background Information on the European Parliament and its Plenary Sessions

Please visit the official EUROPARL web site for further information. The description of the European Parliament by Wikipedia is worthwile as well.

A few facts:

732 Members of the European Parliament (MEPs) out of 25 different countries; 84 MEPs from Ireland and United Kingdom, 50 MEPs from Spain.
The plenary sessions, held once a month for 4-5 days in Strasbourg or Brussels.
The Debates are first available as Rainbow Text Edition, each speech in its original language only and later as Final Text Edition in individual languages.

The EUROPARL web site provides the Rainbow and Final Text Editions from 1996 till now. Europe by Satellite submits most of the European Parliament Plenary Sessions (EPPS); RWTH recorded 48 hours of speech (politicians and interpreters) per language (for English, Spanish, original speeches) from May-October 2004.

Brief overview of the task and conditions

In OpenLab2006, only the Spanish-to-English translation direction is considered for the EPPS task. Three kinds of text data are considered:

ASR. The first one is the output of the automatic speech recognition systems. LIMSI and RWTH provided outputs of their recognizers for the Spanish language, both 1best and different formats of multiple hypotheses. The text was in mixed case (LIMSI) or lower case (RWTH) and no punctuation marks were provided. The data was manually segmented at syntactic or semantic breaks.

SAT. The second type of data is the verbatim transcriptions. They include spontaneous speech phenomena, such as hesitations, corrections, false-starts, etc. As for the ASR output, the text data was provided without punctuation, but here capitalization was used.

FTE. The last data are the Final Text Editions (FTE) provided by the European Parliament. These text transcriptions differ slightly from the verbatim ones. Some sentences are rewritten. The text data include punctuations, uppercase and lowercase and does not include transcription of spontaneous speech phenomena.

An example of the three kinds of inputs is shown below:

FTE: I am starting to know what Frank Sinatra must have felt like.
SAT: I'm I'm I'm starting to know what Frank Sinatra must have felt like
ASR: and i'm times and starting to know what frank sinatra must have felt like

Concerning evaluation conditions, for the sake of simplicity we suggest to adhere to the following requirement:

FTE	case insensitive, punctuation
SAT/ASR	case insensitive, no punctuation

In these cases, take into account that reference texts must be lowercased before evaluation, as they are provided with case information for making them more general.

Of course, feel free to consider the possibility to keep/restore the case information during/after the translation.

Provided Data

OpenLab2006 will provide to participants with three packages (one for each shared task) including the proper subset of the following resources:

Training: the latest version of the EPPS text corpora derived from the Final Text Editions covering April 1996 - to May 2005.
Dev/Test sets: these are the official corpora used in the first TC-Star evaluation, regarding the Spanish-to-English translation task. Two reference transcriptions are available for each sentence.
Spoken Language Translation Outputs: test set outputs (the best translation) of each SLT system participating to the first TC-STAR evaluation are made available.
Speech Recognizer Outputs: the speech recognizer outputs by the LIMSI and RWTH decoders are provided. 1best, lattices and confusion networks of dev/test sets are available.
Morpho-Syntax Annotation: EPPS Spanish-English training/dev/test corpora are available in lemmatized and POS-tagged formats.
Additional resources: direct and inverse word-alignments of the EPPS parallel corpus and trigram language model of the English texts. For these resources please contact directly Mauro Cettolo, cettolo@itc.it.

Note that data can be case sensitive and include punctuation. This may be useful for tagging and parsing tools. However, evaluations are suggested to be case insensitive (and also without punctuation for SAT and ASR texts): this means that each partecipants has to preprocess data in the manner suitable to own purposes, and at least lowercase development and test data.

Software (Linux)

Evaluation tools: for computing bleu, wer, per, lattice wer, ... scores.
For processing EPPS data in UTF8, RWTH provides a set of corpus tools (tokenizer/categorizer, simple proper noun finder, ...). The package includes binaries, scripts and source codes.
For encoding conversion, please use recent (from 1.9) versions of iconv; e.g.
iconv -c -f ISO-8859-1 -t UTF-8

For any problem with shared tasks, request of suggestions and/or comments, please feel free to contact Mauro Cettolo (cettoloitc.it)