Reputation: 463
I am probably missing something obvious but after following the steps outlined in the running locally README, I can't successfully submit a train job in EC2 V100 instance.
So far, I completed the following steps:
Converted the train and test to TFRecord format
Created a new label map pb.txt with the 6 classes for my dataset.
Updated the pipeline config file to reflect the paths and the number of classes.
My final directory structure is as follows (+ denotes folder and - denotes file):
+ models
+ faster_rcnn_resnet101_coco_2018_01_28
- model.ckpt.data-00000-of-00001
- model.ckpt.meta
- model.ckpt.index
+ model
+ train
+ eval
- pipeline.config
+ data
- train.record
- test.record
- tp_label_map.pbtxt
One concern is that I do not know what the train and eval folders inside the models correspond to in the README.
PIPELINE_CONFIG_PATH=/home/ubuntu/models/research/object_detection/models/faster_rcnn_resnet101_coco_2018_01_28/pipeline.config
MODEL_DIR=/home/ubuntu/models/research/object_detection/models/model
NUM_TRAIN_STEPS=50000
SAMPLE_1_OF_N_EVAL_EXAMPLES=1
python object_detection/model_main.py \
--pipeline_config_path=${PIPELINE_CONFIG_PATH} \
--model_dir=${MODEL_DIR} \
--num_train_steps=${NUM_TRAIN_STEPS} \
--sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
--alsologtostderr
I get the following warnings and it just hangs there for 10 mins or so. Not moving to the train stage.
But I do get the files populated in the model directory (train and eval are empty).
+models
- events.out.tfevents.1557175306.ip-172-31-32-179
- graph.pbtxt
- model.ckpt-0.data-00000-of-00001
- model.ckpt-0.index
- model.ckpt-0.meta
If you look up the comment here, but when I checked nvidia-smi
or tensorboard
, I dont see anything generated.
Tensorboard output
WARNING: The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.
*********** In model lib ************* /home/ubuntu/models/research/object_detection/models/faster_rcnn_resnet101_coco_2018_01_28/pipeline.config
WARNING:tensorflow:Forced number of epochs for all eval validations to be 1.
WARNING:tensorflow:Expected number of evaluation epochs is 1, but instead encountered `eval_on_train_input_config.num_epochs` = 0. Overwriting `num_epochs` to 1.
WARNING:tensorflow:Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x7f6c5b26d048>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:From /home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From /home/ubuntu/models/research/object_detection/builders/dataset_builder.py:80: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
WARNING:tensorflow:From /home/ubuntu/models/research/object_detection/utils/ops.py:472: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/ubuntu/models/research/object_detection/inputs.py:320: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/ubuntu/models/research/object_detection/builders/dataset_builder.py:152: batch_and_drop_remainder (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.batch(..., drop_remainder=True)`.
WARNING:tensorflow:From /home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py:1624: flatten (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.flatten instead.
WARNING:tensorflow:From /home/ubuntu/models/research/object_detection/meta_architectures/faster_rcnn_meta_arch.py:2298: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
WARNING:tensorflow:From /home/ubuntu/models/research/object_detection/core/losses.py:345: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.
See `tf.nn.softmax_cross_entropy_with_logits_v2`.
/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:110: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
WARNING:tensorflow:From /home/ubuntu/models/research/object_detection/eval_util.py:785: to_int64 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/ubuntu/models/research/object_detection/utils/visualization_utils.py:429: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version.
Instructions for updating:
tf.py_func is deprecated in TF V2. Instead, use
tf.py_function, which takes a python function which manipulates tf eager
tensors instead of numpy arrays. It's easy to convert a tf eager tensor to
an ndarray (just call tensor.numpy()) but having access to eager tensors
means `tf.py_function`s can use accelerators such as GPUs as well as
being differentiable using a gradient tape.
WARNING:tensorflow:From /home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/training/saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
Upvotes: 1
Views: 2362
Reputation: 4071
The documentation are indeed a bit unclear about model_dir
, but the source code comment has a clear explanation of it.
So model_dir
is the directory where you would save your new checkpoint files to, it is not the same as the pretrained checkpoint files which you used for fine-tuning and you should not set model_dir
to pretrained checkpoint path.
It is better to keep model_dir
empty of checkpoint files each time you submit a new training job, otherwise if there are checkpoint files, the model would possibly skip training (here).
The train
and eval
directories are listed there for illustration. It can be an option to set the directory structure like that but not necessary to be the same. You just need to pass an empty directory to model_dir
where the checkpoint files can be saved.
Upvotes: 1