Task: Mask RCNN train_shapes.ipynb tutorial. Training to segment different shapes in the artificially generated shapes dataset. Problem: Matterport's Mask RCNN implementation doesnt work out of the box for this notebook. Thing's I have tried: Solved all the classes and package errors due to import files namely config, model, utils. Solved the TF2.x errors caused due to code deprecations. Parameters I have set: Configurations: BACKBONE resnet101 BACKBONE_STRIDES [4, 8, 16, 32, 64] BATCH_SIZE 1 BBOX_STD_DEV [0.1 0.1 0.2 0.2] COMPUTE_BACKBONE_SHAPE None DETECTION_MAX_INSTANCES 100 DETECTION_MIN_CONFIDENCE 0.7 DETECTION_NMS_THRESHOLD 0.3 FPN_CLASSIF_FC_LAYERS_SIZE 1024 GPU_COUNT 1 GRADIENT_CLIP_NORM 5.0 IMAGES_PER_GPU 1 IMAGE_CHANNEL_COUNT 3 IMAGE_MAX_DIM 128 IMAGE_META_SIZE 16 IMAGE_MIN_DIM 128 IMAGE_MIN_SCALE 0 IMAGE_RESIZE_MODE square IMAGE_SHAPE [128 128 3] LEARNING_MOMENTUM 0.9 LEARNING_RATE 0.001 LOSS_WEIGHTS {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0} MASK_POOL_SIZE 14 MASK_SHAPE [28, 28] MAX_GT_INSTANCES 100 MEAN_PIXEL [123.7 116.8 103.9] MINI_MASK_SHAPE (56, 56) NAME shapes NUM_CLASSES 4 POOL_SIZE 7 POST_NMS_ROIS_INFERENCE 1000 POST_NMS_ROIS_TRAINING 2000 PRE_NMS_LIMIT 6000 ROI_POSITIVE_RATIO 0.33 RPN_ANCHOR_RATIOS [0.5, 1, 2] RPN_ANCHOR_SCALES (8, 16, 32, 64, 128) RPN_ANCHOR_STRIDE 1 RPN_BBOX_STD_DEV [0.1 0.1 0.2 0.2] RPN_NMS_THRESHOLD 0.7 RPN_TRAIN_ANCHORS_PER_IMAGE 256 STEPS_PER_EPOCH 5 TOP_DOWN_PYRAMID_SIZE 256 TRAIN_BN False TRAIN_ROIS_PER_IMAGE 5 USE_MINI_MASK False USE_RPN_ROIS True VALIDATION_STEPS 5 WEIGHT_DECAY 0.0001 Implementation details: I am using coco weights to initialize my model. Model in training mode. Training heads first. Epoch = 1 Learning rate = 0.001 Output: Starting at epoch 0. LR=0.001 Checkpoint Path: /logs/shapes20211123T0437/mask_rcnn_shapes_{epoch:04d}.h5 Selecting layers to train fpn_c5p5 (Conv2D) fpn_c4p4 (Conv2D) fpn_c3p3 (Conv2D) fpn_c2p2 (Conv2D) fpn_p5 (Conv2D) fpn_p2 (Conv2D) fpn_p3 (Conv2D) fpn_p4 (Conv2D) rpn_model (Functional) mrcnn_mask_conv1 (TimeDistributed) mrcnn_mask_bn1 (TimeDistributed) mrcnn_mask_conv2 (TimeDistributed) mrcnn_mask_bn2 (TimeDistributed) mrcnn_class_conv1 (TimeDistributed) mrcnn_class_bn1 (TimeDistributed) mrcnn_mask_conv3 (TimeDistributed) mrcnn_mask_bn3 (TimeDistributed) mrcnn_class_conv2 (TimeDistributed) mrcnn_class_bn2 (TimeDistributed) mrcnn_mask_conv4 (TimeDistributed) mrcnn_mask_bn4 (TimeDistributed) mrcnn_bbox_fc (TimeDistributed) mrcnn_mask_deconv (TimeDistributed) mrcnn_class_logits (TimeDistributed) mrcnn_mask (TimeDistributed) /usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/gradient_descent.py:102: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead. super(SGD, self).__init__(name, **kwargs) This is the only thing i can see. And there is no progress bar of epoch run. And this stays the same for 2-3 Hours. I later found out that this individual has done the code clean up as well. So i also experimented with his ".py" files and still the same occurs. System harware specifications: Intel Xeon 12 CPU 25GB RAM 64GB Storage. Ubuntu 20.04 Desktop. VM running on company's internal server. Software Specifications: Anaconda Latest version TF 2.7.0 Keras 2.4 Questions: Why does the training doesn't start even after 3 hours? Is there an error in my configuration? Is my system sufficient? Is the implementation correct? What changes should be done to make this work? Notebook: Colab notebook

pythontensorflowkerasfaster-rcnnmatterport

Reputation: 45

Matterport's mask rcnn doesn't train after setting up parameters

Task: Mask RCNN train_shapes.ipynb tutorial. Training to segment different shapes in the artificially generated shapes dataset.

Problem: Matterport's Mask RCNN implementation doesnt work out of the box for this notebook.

Thing's I have tried:

Solved all the classes and package errors due to import files namely config, model, utils.
Solved the TF2.x errors caused due to code deprecations.

Parameters I have set:

Configurations:
BACKBONE                       resnet101
BACKBONE_STRIDES               [4, 8, 16, 32, 64]
BATCH_SIZE                     1
BBOX_STD_DEV                   [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE         None
DETECTION_MAX_INSTANCES        100
DETECTION_MIN_CONFIDENCE       0.7
DETECTION_NMS_THRESHOLD        0.3
FPN_CLASSIF_FC_LAYERS_SIZE     1024
GPU_COUNT                      1
GRADIENT_CLIP_NORM             5.0
IMAGES_PER_GPU                 1
IMAGE_CHANNEL_COUNT            3
IMAGE_MAX_DIM                  128
IMAGE_META_SIZE                16
IMAGE_MIN_DIM                  128
IMAGE_MIN_SCALE                0
IMAGE_RESIZE_MODE              square
IMAGE_SHAPE                    [128 128   3]
LEARNING_MOMENTUM              0.9
LEARNING_RATE                  0.001
LOSS_WEIGHTS                   {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE                 14
MASK_SHAPE                     [28, 28]
MAX_GT_INSTANCES               100
MEAN_PIXEL                     [123.7 116.8 103.9]
MINI_MASK_SHAPE                (56, 56)
NAME                           shapes
NUM_CLASSES                    4
POOL_SIZE                      7
POST_NMS_ROIS_INFERENCE        1000
POST_NMS_ROIS_TRAINING         2000
PRE_NMS_LIMIT                  6000
ROI_POSITIVE_RATIO             0.33
RPN_ANCHOR_RATIOS              [0.5, 1, 2]
RPN_ANCHOR_SCALES              (8, 16, 32, 64, 128)
RPN_ANCHOR_STRIDE              1
RPN_BBOX_STD_DEV               [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD              0.7
RPN_TRAIN_ANCHORS_PER_IMAGE    256
STEPS_PER_EPOCH                5
TOP_DOWN_PYRAMID_SIZE          256
TRAIN_BN                       False
TRAIN_ROIS_PER_IMAGE           5
USE_MINI_MASK                  False
USE_RPN_ROIS                   True
VALIDATION_STEPS               5
WEIGHT_DECAY                   0.0001

Implementation details:

I am using coco weights to initialize my model.
Model in training mode.
Training heads first.
Epoch = 1
Learning rate = 0.001

Output:


Starting at epoch 0. LR=0.001

Checkpoint Path: /logs/shapes20211123T0437/mask_rcnn_shapes_{epoch:04d}.h5
Selecting layers to train
fpn_c5p5               (Conv2D)
fpn_c4p4               (Conv2D)
fpn_c3p3               (Conv2D)
fpn_c2p2               (Conv2D)
fpn_p5                 (Conv2D)
fpn_p2                 (Conv2D)
fpn_p3                 (Conv2D)
fpn_p4                 (Conv2D)
rpn_model              (Functional)
mrcnn_mask_conv1       (TimeDistributed)
mrcnn_mask_bn1         (TimeDistributed)
mrcnn_mask_conv2       (TimeDistributed)
mrcnn_mask_bn2         (TimeDistributed)
mrcnn_class_conv1      (TimeDistributed)
mrcnn_class_bn1        (TimeDistributed)
mrcnn_mask_conv3       (TimeDistributed)
mrcnn_mask_bn3         (TimeDistributed)
mrcnn_class_conv2      (TimeDistributed)
mrcnn_class_bn2        (TimeDistributed)
mrcnn_mask_conv4       (TimeDistributed)
mrcnn_mask_bn4         (TimeDistributed)
mrcnn_bbox_fc          (TimeDistributed)
mrcnn_mask_deconv      (TimeDistributed)
mrcnn_class_logits     (TimeDistributed)
mrcnn_mask             (TimeDistributed)

/usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/gradient_descent.py:102: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.
  super(SGD, self).__init__(name, **kwargs)

This is the only thing i can see. And there is no progress bar of epoch run. And this stays the same for 2-3 Hours.
I later found out that this individual has done the code clean up as well. So i also experimented with his ".py" files and still the same occurs.

System harware specifications:

Intel Xeon 12 CPU
25GB RAM
64GB Storage.
Ubuntu 20.04 Desktop. VM running on company's internal server.

Software Specifications:

Anaconda Latest version
TF 2.7.0
Keras 2.4

Questions:

Why does the training doesn't start even after 3 hours?
Is there an error in my configuration?
Is my system sufficient?
Is the implementation correct?
What changes should be done to make this work?

Notebook: Colab notebook

Upvotes: 0

Answers (3)

Wes

Reputation: 1840

I had the same issue. The fix of setting workers to 1 and disabling multi-processing didn't work. I found out that it was trying to use the CPU instead of GPU. The fix was to make sure CUDA was installed properly, or if on HPC doing something like module load cuda on HPC and make sure you've provisioned a node with a GPU.

Upvotes: 0

D.Manasreh

Reputation: 930

Try this:

1- Inside the (mrcnn) folder open the file (model.py).

2- Change line 2362 from:

workers = multiprocessing.cpu_count()

to:

workers = 1

3- Change line 2374 from:

use_multiprocessing=True,

to:

use_multiprocessing=False,

Or you can try using this fork where I already did these changes. https://github.com/manasrda/Mask_RCNN This fixed a similar problem for me.

Upvotes: 1

Ruthy

Reputation: 55

The training hangs, and this is actually kind of a known issue. The fix is simple: Find the fit function in the model.py file (should be somewhere around line 2360-2370 in the TF2 project), and set the 'workers' argument to 1 and the 'use_multiprocessing' argument to False.

Upvotes: 1

Matterport&#39;s mask rcnn doesn&#39;t train after setting up parameters

Answers (3)

Related Questions

Matterport's mask rcnn doesn't train after setting up parameters