Welcome to the Simple4All Tundra Corpus

This is an ongoing project which aims at collecting an extended number of speech resources in multiple languages and to make them freely available for the speech processing community. The first version of the Tundra corpus is a collection of 14 audiobooks in 14 languages: Bulgarian, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Polish, Portuguese, Romanian, Russian and Spanish. The sources for the speech and text data of each audiobook are listed below.

To download the segmented and aligned data please go to the Download section. You can also download only 1 hour subsets from each language by following this LINK, or listen to synthetic speech samples built from the corpus in the Demo section.

Language Code Title Author Speech URL Text URL Speaker Gender Total duration [hours] Aligned duration [hours]
BulgarianBGZhetvariatYordan YovkovLINK LINK M6.14.1
DanishDAGrimms eventyr I udvalgGrimm BrothersLINKLINKM2.10.7
DutchNLAnna KareninaLeo TolstoyLINKLINKM6.54.5
EnglishENLiving AloneStella BensonLINKLINKF4.52.3
FinnishFNRautatieJuhani AhoLINKLINKF3.12.5
FrenchFRCandideVoltaireLINKLINKM42.1
GermanDEDas Bildnis des Dorian GrayOscar WildeLINKLINKM9.57.9
HungarianHUEgri csillagokGeza GardonyiLINKLINKF8.55
ItalianITGalateaAnton Giulio BarriliLINKLINKM6.55.3
PolishPLSiedem wybranyc opowiadanWladyslaw OrkanLINKLINKF3.12.6
PortuguesePRSenhoraJose de AlencarLINKLINKF9.25.2
Romanian*ROMaraIoan SlaviciLINKLINKF11.16.5
RussianRUUcheniye Khrista Leo TolstoyLINKLINKM2.11.3
Spanish**ESDon Quijote de la ManchaMiguel de CervantesLINKLINKM12.18.0
*The audiobook has been released under a Non-commercial License. Please check the source for more information.
**Only the first 35 chapters from the first part were used for alignment

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 Unported License. The underlying audio and text are subject to their source licenses, so please check the links before using the data.