Vinit Sutar
Vinit Sutar

Reputation: 45

Matterport's mask rcnn doesn't train after setting up parameters

Task: Mask RCNN train_shapes.ipynb tutorial. Training to segment different shapes in the artificially generated shapes dataset.

Problem: Matterport's Mask RCNN implementation doesnt work out of the box for this notebook.

Thing's I have tried:

  1. Solved all the classes and package errors due to import files namely config, model, utils.
  2. Solved the TF2.x errors caused due to code deprecations.

Parameters I have set:

Configurations:
BACKBONE                       resnet101
BACKBONE_STRIDES               [4, 8, 16, 32, 64]
BATCH_SIZE                     1
BBOX_STD_DEV                   [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE         None
DETECTION_MAX_INSTANCES        100
DETECTION_MIN_CONFIDENCE       0.7
DETECTION_NMS_THRESHOLD        0.3
FPN_CLASSIF_FC_LAYERS_SIZE     1024
GPU_COUNT                      1
GRADIENT_CLIP_NORM             5.0
IMAGES_PER_GPU                 1
IMAGE_CHANNEL_COUNT            3
IMAGE_MAX_DIM                  128
IMAGE_META_SIZE                16
IMAGE_MIN_DIM                  128
IMAGE_MIN_SCALE                0
IMAGE_RESIZE_MODE              square
IMAGE_SHAPE                    [128 128   3]
LEARNING_MOMENTUM              0.9
LEARNING_RATE                  0.001
LOSS_WEIGHTS                   {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE                 14
MASK_SHAPE                     [28, 28]
MAX_GT_INSTANCES               100
MEAN_PIXEL                     [123.7 116.8 103.9]
MINI_MASK_SHAPE                (56, 56)
NAME                           shapes
NUM_CLASSES                    4
POOL_SIZE                      7
POST_NMS_ROIS_INFERENCE        1000
POST_NMS_ROIS_TRAINING         2000
PRE_NMS_LIMIT                  6000
ROI_POSITIVE_RATIO             0.33
RPN_ANCHOR_RATIOS              [0.5, 1, 2]
RPN_ANCHOR_SCALES              (8, 16, 32, 64, 128)
RPN_ANCHOR_STRIDE              1
RPN_BBOX_STD_DEV               [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD              0.7
RPN_TRAIN_ANCHORS_PER_IMAGE    256
STEPS_PER_EPOCH                5
TOP_DOWN_PYRAMID_SIZE          256
TRAIN_BN                       False
TRAIN_ROIS_PER_IMAGE           5
USE_MINI_MASK                  False
USE_RPN_ROIS                   True
VALIDATION_STEPS               5
WEIGHT_DECAY                   0.0001

Implementation details:

  1. I am using coco weights to initialize my model.
  2. Model in training mode.
  3. Training heads first.
  4. Epoch = 1
  5. Learning rate = 0.001

Output:


Starting at epoch 0. LR=0.001

Checkpoint Path: /logs/shapes20211123T0437/mask_rcnn_shapes_{epoch:04d}.h5
Selecting layers to train
fpn_c5p5               (Conv2D)
fpn_c4p4               (Conv2D)
fpn_c3p3               (Conv2D)
fpn_c2p2               (Conv2D)
fpn_p5                 (Conv2D)
fpn_p2                 (Conv2D)
fpn_p3                 (Conv2D)
fpn_p4                 (Conv2D)
rpn_model              (Functional)
mrcnn_mask_conv1       (TimeDistributed)
mrcnn_mask_bn1         (TimeDistributed)
mrcnn_mask_conv2       (TimeDistributed)
mrcnn_mask_bn2         (TimeDistributed)
mrcnn_class_conv1      (TimeDistributed)
mrcnn_class_bn1        (TimeDistributed)
mrcnn_mask_conv3       (TimeDistributed)
mrcnn_mask_bn3         (TimeDistributed)
mrcnn_class_conv2      (TimeDistributed)
mrcnn_class_bn2        (TimeDistributed)
mrcnn_mask_conv4       (TimeDistributed)
mrcnn_mask_bn4         (TimeDistributed)
mrcnn_bbox_fc          (TimeDistributed)
mrcnn_mask_deconv      (TimeDistributed)
mrcnn_class_logits     (TimeDistributed)
mrcnn_mask             (TimeDistributed)

/usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/gradient_descent.py:102: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.
  super(SGD, self).__init__(name, **kwargs)

System harware specifications:

  1. Intel Xeon 12 CPU
  2. 25GB RAM
  3. 64GB Storage.
  4. Ubuntu 20.04 Desktop. VM running on company's internal server.

Software Specifications:

  1. Anaconda Latest version
  2. TF 2.7.0
  3. Keras 2.4

Questions:

  1. Why does the training doesn't start even after 3 hours?
  2. Is there an error in my configuration?
  3. Is my system sufficient?
  4. Is the implementation correct?
  5. What changes should be done to make this work?

Notebook: Colab notebook

Upvotes: 0

Views: 901

Answers (3)

Wes
Wes

Reputation: 1840

I had the same issue. The fix of setting workers to 1 and disabling multi-processing didn't work. I found out that it was trying to use the CPU instead of GPU. The fix was to make sure CUDA was installed properly, or if on HPC doing something like module load cuda on HPC and make sure you've provisioned a node with a GPU.

Upvotes: 0

D.Manasreh
D.Manasreh

Reputation: 930

Try this:

1- Inside the (mrcnn) folder open the file (model.py).

2- Change line 2362 from:

workers = multiprocessing.cpu_count()

to:

workers = 1

3- Change line 2374 from:

use_multiprocessing=True,

to:

use_multiprocessing=False,

Or you can try using this fork where I already did these changes. https://github.com/manasrda/Mask_RCNN This fixed a similar problem for me.

Upvotes: 1

Ruthy
Ruthy

Reputation: 55

The training hangs, and this is actually kind of a known issue. The fix is simple: Find the fit function in the model.py file (should be somewhere around line 2360-2370 in the TF2 project), and set the 'workers' argument to 1 and the 'use_multiprocessing' argument to False.

Upvotes: 1

Related Questions