Fabrizio
Fabrizio

Reputation: 1074

Datasets like "The LJ Speech Dataset"

I am trying to find databases like the LJ Speech Dataset made by Keith Ito. I need to use these datasets in TacoTron 2 (Link), so I think datasets need to be structured in a certain way. the LJ database is linked directly into the tacotron 2 github page, so I think it's safe to assume it's made to work with it. So I think Databases should have the same structure as the LJ. I downloaded the Dataset and I found out that it's structured like this:

main folder:

    -wavs

        -001.wav

        -002.wav

        -etc
    -metadata.csv: This file is a csv file which contains all the things said in every .wav, in a form like this **001.wav | hello etc.**

So, my question is: Are There other datasets like this one for further training?

But I think there might be problems, for example, the voice from one dataset would be different from the one in one another, would this cause too much problems? And also different slangs or things like that can cause problems?

Upvotes: 2

Views: 3556

Answers (2)

GuyPaddock
GuyPaddock

Reputation: 2517

Mozilla includes a database of several datasets you can download and use, if you don't need your own custom language or voice: https://voice.mozilla.org/data

Alternatively, you could create your own dataset following the structure you outlined in your OP. The metadata.csv file needs to contain at least two columns -- the first is the path/name of the WAV file (without the .wav extension), and the second column is the text that has been spoken.

Unless you are training Tacotron with speaker embedding/a multi-speaker model, you'd want all the recordings to be from the same speaker. Ideally, the audio quality should be very consistent with a minimum amount of background noise. Some background noise can be removed using RNNoise. There's a script in the Mozilla Discourse group that you can use as a reference. All the recordings files need to be short, 22050 Hz, 16-bit audio clips.

As for slag or local colloquialisms -- not sure; I suspect that as long as the word sounds match what's written (i.e. the phonemes match up), I would expect the system to be able to handle it. Tacotron is able to handle/train on multiple languages.

If you don't have the resources to produce your own recordings, you could use audio from a permissively licensed audiobook in the target language. There's a tutorial on this very topic here: https://medium.com/@klintcho/creating-an-open-speech-recognition-dataset-for-almost-any-language-c532fb2bc0cf

The tutorial has you:

  1. Download the audio from the audiobook.
  2. Remove any parts that aren't useful (e.g. the introduction, foreward, etc) with Audacity.
  3. Use Aeneas to fine-tune and then export a forced alignment between the audio and the text of the e-book, so that the audio can be exported sentence by sentence.
  4. Create the metadata.csv file containing the map from audio to segments. (The format that the post describes seems to include extra columns that aren't really needed for training and are mainly for use by Mozilla's online database).

You can then use this dataset with systems that support LJSpeech, like Mozilla TTS.

Upvotes: 0

dabhand
dabhand

Reputation: 517

There a few resources:

The main ones I would look at are Festvox (aka CMU artic) http://www.festvox.org/dbs/index.html and LibriVoc https://librivox.org/

these guys seem to be maintaining a list https://github.com/candlewill/Speech-Corpus-Collection

And I am part of a project that is collecting more (shameless self plug): https://github.com/Idlak/Living-Audio-Dataset

Upvotes: 1

Related Questions