Blizzard Challenge 2012 Rules

  • A registration fee of 500GBP (approx 800USD) is payable by all participants in task EH2.1 to offset the costs of running the challenge, including paying local assistants and listeners. The fee must be paid by the end of May 2012. You can pay this fee using Edinburgh University's online payments system: and register for the event called 'Blizzard Challenge 2012'. After doing this, please also email to notify us that you have paid. If you are really unable to use the online payments system, please contact for assistance with other methods of payment. However, we strongly prefer the epay system because it reduces the costs and admin work for us. If you must pay by bank transfer, please contact us in plenty of time (several weeks before the payment deadline); an additional charge of 50GBP will be made for any payments not made using the epay system.


  • Each participant should try to recruit at least ten volunteer listeners for the evaluation tests. Native speakers are preferable, where possible. The organisers would also appreciate assistance in advertising the Challenge as widely as possible (e.g., to your students or colleagues).


All participants will have access to the following materials (subject to signing appropriate licenses):

Voice building data

Audiobook data, segmented into utterances and with transcriptions are supplied, comprising around 50 hours of speech material, of which around 32 hours have high-confidence transcriptions, with the remainder having transcriptions of lower confidence.

Development data

You will be provided with a development set of natural speech, plus synthetic speech from around 3-5 preliminary systems, along with transcriptions. The data will comprise the following (subject to confirmation):

  • a set of isolated sentences from the Mark Twain novel Alonso Fitz
  • some paragraphs from the Mark Twain novel Alonso Fitz
  • a set of newspaper sentences from the Blizzard 2011 test set
  • a set of SUS from the Blizzard 2011 test set

These data are intended for participants in task EH2.2, to assist in designing an evaluation. They may be used by all other participants, subject to the rules on use of external data.

Note that for the sentences taken from the Blizzard 2011 test set, natural and synthetic speech from the 2011 challenge and the corresponding listening test results, can be downloaded from . Note that the 2011 challenge used a different speech database to the 2012 challenge.


This year there are two strands to the Blizzard Challenge: the speech synthesis strand (tasks EH1.1 and EH2.1) and the evaluation strand (task EH2.2)

  • It is not permissible for a single participant to submit multiple entries task EH2.1, because the listening test will become unmanageable. This rule may be relaxed in the event of a small number of participants.
  • Participants involved in joint projects or consortia who wish to submit multiple systems (e.g., an individual entry and a joint system) should contact the organisers in advance to agree this. We will try to accommodate all reasonable requests, provided the listening test remains manageable.
  • It is permissible (and encouraged) to participate in both EH2.1 and EH2.2.
  • You may register and participate in tasks EH2.1 and/or EH2.2, even if you did not complete task EH1.1.

Phase One (now completed)

  • Task EH1.1: build a voice from the supplied audiobook data. This voice should be demonstrated at the Blizzard Challenge Workshop 2011, in Turin, Italy. There is no formal evaluation in Phase One.

Phase Two (now active)

There are two subtasks in this phase. The first one is a synthesis task, similar to those in many previous challenges, but using audiobook data. The second subtask is different from anything we have done before: it does not involve building a synthesiser, but rather in designing and conducting a novel type of evaluation. Of course, the Blizzard organisers will still conduct a conventional evaluation similar to previous years, but the audiobook task probably needs new and better form of evaluation (e.g., using stimuli longer than sentences).

  • Task EH2.1 - build a voice from the supplied audiobook data. Sentences synthesised using this voice should be submitted to the Blizzard organisers for formal evaluation, by the date specified in the timeline.
  • Task EH2.2 - devise a method for evaluating synthetic speech for audiobook applications, and use it to evaluate task EH2.1. The evaluation can use any text you wish (but you are encouraged to consider using both 'in domain' and 'out of domain' text). It can measure any aspect of the synthetic speech that you think is relevant to its performance as an "audiobook reader". You will have to opportunity to request the participants in task EH2.1 to synthesise text provided by you. Participants in this task will be responsible for executing their own listening test: the Blizzard organisers will be running an independent test of their own in parallel.


  • "External data" is defined as data, of any type, that is not part of the provided database.
  • You are allowed to use external data in any way you wish, subject to any exclusions given in these rules
  • Use of external data is entirely optional and is not compulsory
  • You may use the provided audio files, or you may obtain and use the original recordings by John Greenman directly from of the following four books by Mark Twain:
    • A Tramp Abroad
    • Life on the Mississippi
    • The Adventures of Tom Sawyer
    • The Man That Corrupted Hadleyburg, and Other Stories
  • You must not use any additional data from the same speaker (John Greenman), or recordings of any other material by the same author (Mark Twain), or any text by the same author (Mark Twain).
  • You may exclude any parts of the provided databases if you wish.
  • Use of the provided segmentations, transcriptions or labels is optional.
  • The provided development set is intended for use in listening tests only. You must not use the development speech data for voice building (e.g., by including it in a unit selection inventory, or using it to train acoustic models). However, you may use the results of listening tests based on these data to guide your system design.
  • If you are in any doubt about how to apply these rules, please contact the organizers immediately.


  • Phase One: a set of test sentences will be distributed before the 2011 workshop, but no formal listening test is planned. The test sentences will be drawn from contiguous (e.g., paragraph-sized) sections of novels and will have similar segmentation, transcriptions and labels to the distributed corpus.
  • Phase Two: the exact nature of the test set will be determined partly by the entries received for Task EH2.2 but is likely to include both sentence- and paragraph-sized texts from a similar domain to the provided corpus, as well as texts from other domains. Formal listening tests will be conducted to evaluate the synthetic speech submitted during Phase Two.


  • Any examples that you submit for evaluation will be retained by the Blizzard organisers for future use.
  • You must include in your submission of the test sentences a statement of whether you give the organisers permission to publically distribute your waveforms and the corresponding listening test results in anonymised form. In the past, all participants have agreed to this and we strongly encourage you to give this consent.


  • The Blizzard organisers will conduct a listening test design which will probably include the standard elements used in previous years (naturalness, speaker similarity, intelligibility) and may be extended to include additional tests specific to the audiobook reading task. Participants in task EH2.2 will have access to anonymised versions of the submissions for task EH2.1 and will perform their own evaluations.


  • Each participant will be expected to submit a six-page paper describing their entry for review.
  • One of the authors of each accepted paper should present it at the Blizzard 2012 Workshop
  • In addition, each participant will be expected to complete a form giving the general technical specification of their system, to facilitate easy cross-system comparisons (e.g. is it unit selection? does it predict prosody? etc. etc)


  • This is a challenge, which is designed to answer scientific questions, and not a competition. Therefore, we rely on your honesty in preparing your entry.

