Reputation: 53806
Running the model out of the box generates these files in the data dir :
ls
dev-v2.tgz newstest2013.en
giga-fren.release2.fixed.en newstest2013.en.ids40000
giga-fren.release2.fixed.en.gz newstest2013.fr
giga-fren.release2.fixed.en.ids40000 newstest2013.fr.ids40000
giga-fren.release2.fixed.fr training-giga-fren.tar
giga-fren.release2.fixed.fr.gz vocab40000.from
giga-fren.release2.fixed.fr.ids40000 vocab40000.to
Reading the src of translate.py :
https://github.com/tensorflow/models/blob/master/tutorials/rnn/translate/translate.py
tf.app.flags.DEFINE_string("from_train_data", None, "Training data.")
tf.app.flags.DEFINE_string("to_train_data", None, "Training data.")
To utilize my own training data I created dirs my-from-train-data & to-from-train-data and add my own training data to each of these dirs, training data is contained in the files mydata.from & mydata.to
my-to-train-data contains mydata.from
my-from-train-data contains mydata.to
I could not find documentation as to using own training data or what format it should take so I inferred this from the translate.py src and contents of data dir created when executing translate model out of the box.
Contents of mydata.from :
Is this a question
Contents of mydata.to :
Yes!
I then attempt to train the model using :
python translate.py --from_train_data my-from-train-data --to_train_data my-to-train-data
This returns with an error :
tensorflow.python.framework.errors_impl.NotFoundError: my-from-train-data.ids40000
Appears I need to create file my-from-train-data.ids40000 , what should it's contents be ? Is there an example of how to train this model using custom data ?
Upvotes: 14
Views: 1077
Reputation: 4451
blue-sky
Great question, training a model on your own data is way more fun than using the standard data. An example of what you could put in the terminal is:
python translate.py --from_train_data mydatadir/to_translate.in --to_train_data mydatadir/to_translate.out --from_dev_data mydatadir/test_to_translate.in --to_dev_data mydatadir/test_to_translate.out --train_dir train_dir_model --data_dir mydatadir
What goes wrong in your example is that you are not pointing to a file, but to a folder. from_train_data should always point to a plaintext file, whose rows should be aligned with those in the to_train_data file.
Also: as soon as you run this script with sensible data (more than one line ;) ), translate.py will generate your ids (40.000 if from_vocab_size and to_vocab_size are not set). Important to know is that this file is created in the folder specified by data_dir... if you do not specify one this means they are generated in /tmp (I prefer them at the same place as my data).
Hope this helps!
Upvotes: 3
Reputation: 3471
Quick answer to :
Appears I need to create file my-from-train-data.ids40000 , what should it's contents be ? Is there an example of how to train this model using custom data ?
Yes, that's the vocab/ word-id file missing, which is generated when preparing to create the data.
Here is a tutorial from the Tesnorflow documentation.
quick over-view of the files and why you might be confused by the files outputted vs what to use:
python/ops/seq2seq.py
: >> Library for building sequence-to-sequence models.models/rnn/translate/seq2seq_model.py
: >> Neural translation sequence-to-sequence model.models/rnn/translate/data_utils.py
: >> Helper functions for preparing translation data.models/rnn/translate/translate.py
: >> Binary that trains and runs the translation model.The Tensorflow translate.py
file requires several files to be generated when using your own corpus to translate.
It needs to be aligned, meaning: language line 1 in file 1.
<> language line 1 file 2.
This
allows the model to do encoding and decoding.
You want to make sure the Vocabulary have been generated from the dataset using this file: Check these steps:
python translate.py
--data_dir [your_data_directory] --train_dir [checkpoints_directory]
--en_vocab_size=40000 --fr_vocab_size=40000
Note! If the Vocab-size is lower, then change that value.
There is a longer discussion here tensorflow/issues/600
If all else fails, check out this ByteNet implementation in Tensorflow which does translation task as well.
Upvotes: 2