Martin1234567
Martin1234567

Reputation: 11

First tensorflow object detection model - from image to .pb (finally to myriad X blob for Oak-D) - rookie questions

This is my first model, i´m new to python and this is my second post on Stackoverflow so please let me know if there is anything i should elaborate and keep in mind there could be a easy solution to my problems.

EDIT - I have found a way to test the model on images after training (when its a pg file) and this gives a great result so i´m right now focusing the question on the last prat, the conversion to ir and blob EDIT

Problem to solve with model; Easiest way to explain it is an example. Imagine a parking lot with 8 parking spots pared two by two. simple sketch over parking lot

the idea is to know what parking spots that are occupied or not. as the camera is located "from the side" i cant really mark the parking spots as if for example spot 6 is occupied the ground that is spot 5 is not visible.

my plan is that when a car is coming down the road the car is labled "car" and when the car parks the car is re-labeled as "car_6" (if the car parks on parkingspot 6). do i need to use some sort of hierarchy to make the change from "car" to "car_6" without problems? (The parking-spot is an example to explain the problem so there will be no problems with permissions to record videos ). The objects that will be detected in the model are about the same size of a car but round.

After some advice i have heard that ssd_mobilenet_v2 is a good model for the problem. As speed is not really that important i think all models would be able to get the job done but i have worked a bit with the mobilenet not and learned a bit so i would like to stay with this model.

I'm using Tensorflow 2 in windows 10.

What i have done so far;

Labeled (with labelimg) around 250 pictures (.XML) and made them to csv and then to .record. the pictures are 320*320 in jpg-format

The label map is a file created as a txt and then saved and turned to pbtxt (just by adding the pb in the file name in windows).

(filepath) is used to make the links a bit shorter, i have a very excessive directory system for this.

This is my training settings: D:/(filepath)/object detection> python model_main_tf2.py --pipeline_config_path D:\(filepath)\pre-trained-models\ssd_mobilenet_v2_320x320_coco17_tpu-8\pipeline.config --model_dir D:\(filepath)\models\model_out

with the pipeline config:

model {
  ssd {
    num_classes: 9
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    feature_extractor {
      type: "ssd_mobilenet_v2_keras"
      depth_multiplier: 1.0
      min_depth: 16
      conv_hyperparams {
        regularizer {
          l2_regularizer {
            weight: 0.005
          }
        }
        initializer {
          truncated_normal_initializer {
            mean: 0.0
            stddev: 0.029
          }
        }
        activation: RELU_6
        batch_norm {
          decay: 0.9700000286102295
          center: true
          scale: true
          epsilon: 0.0010000000474974513
          train: true
        }
      }
      override_base_feature_extractor_hyperparams: true
    }
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    box_predictor {
      convolutional_box_predictor {
        conv_hyperparams {
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            random_normal_initializer {
              mean: 0.0
              stddev: 0.03
            }
          }
          activation: RELU_6
          batch_norm {
            decay: 0.9
            center: true
            scale: true
            epsilon: 1.0
            train: true
          }
        }
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.8
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
        class_prediction_bias_init: 0.2
      }
    }
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.2
        max_scale: 0.95
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.3333
      }
    }
    post_processing {
      batch_non_max_suppression {
        score_threshold: 0.5
        iou_threshold: 0.6000000238418579
        max_detections_per_class: 100
        max_total_detections: 100
        use_static_shapes: false
      }
      score_converter: SIGMOID
    }
    normalize_loss_by_num_matches: true
    loss {
      localization_loss {
        weighted_smooth_l1 {
          delta: 1.0
        }
      }
      classification_loss {
        weighted_sigmoid_focal {
          gamma: 2.0
          alpha: 0.75
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    encode_background_as_zeros: true
    normalize_loc_loss_by_codesize: true
    inplace_batchnorm_update: true
    freeze_batchnorm: false
  }
}
train_config {
  batch_size: 4
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    random_rgb_to_gray {
    }
  }
  sync_replicas: true
  optimizer {
    momentum_optimizer {
      learning_rate {
        cosine_decay_learning_rate {
          learning_rate_base: 0.008
          total_steps: 50000
          warmup_learning_rate: 0.002
          warmup_steps: 1000
        }
      }
      momentum_optimizer_value: 0.9
      
    }
    use_moving_average: false
  }
  fine_tune_checkpoint: "D:/Visual_Studio/TensorflowTake3/models/research/object_detection/downloaded_ssd_mobilenet_v2_320x320/checkpoint/ckpt-0"
  num_steps: 30000
  startup_delay_steps: 0.0
  replicas_to_aggregate: 8
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
  fine_tune_checkpoint_type: "detection"
  fine_tune_checkpoint_version: V2
}
train_input_reader {
  label_map_path: "D:/Visual_Studio/TensorflowTake3/label_map/label_map.pbtxt"
  tf_record_input_reader {
    input_path: "D:/Visual_Studio/TensorflowTake3/record_file/tf_cropped/tf_cropped_train.record"
  }
}
eval_config {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
}
eval_input_reader {
  label_map_path: "D:/Visual_Studio/TensorflowTake3/label_map/label_map.pbtxt"
  shuffle: false
  num_epochs: 1
  tf_record_input_reader {
    input_path: "D:/Visual_Studio/TensorflowTake3/record_file/tf_cropped/tf_cropped_test.record"
  }
}
(i hope the code is readable, i had some problems with the coding)
}```

I have tried a number of different versions of this file. As i have a relative small number of pictures and the "simplicity" of the pictures is quite high i have tried using num_steps from 1000 up to 50000 rendering in pretty much the same result.

I have changed the data_augmentation_options to random_horizontal_flip in hope of not changing the pictures to much as the model might could benefit from that as the final environment is very static.

The model will be applied on a Oak-D camera, used inside with not much natural light so the change in color for the final model will be minimal.

After training this script is used to export:

python .\exporter_main_v2.py --input_type image_tensor --pipeline_config_path D:\(filepath)\ssd_mobilenet_v2_320x320_coco17_tpu-8\pipeline.config --trained_checkpoint_dir D:\(filepath)\model_out --output_directory D:\(filepath)\my_model

This is used for creating a IR: python mo_tf.py --saved_model_dir D:\(filepath)\saved_model\ --data_type=FP16 --scale_values [255,255,255] --output_dir D:\(filepath)\20220324_night --tensorflow_use_custom_operations_config C:\Users\(filepath)\envs\openvino\Lib\site-packages\mo\extensions\front\tf\ssd_support_api_v2.4.json --tensorflow_object_detection_api_pipeline_config D:\(filepath)\model\pipeline.config --log_level DEBUG --tensorboard_logdir D:\(filepath)\logdir

For the final conversion i use the online blob converter at http://blobconverter.luxonis.com/ with Choose model source "open vino model" myriadX compile params: -ip U8 (default) and shaves 6.

When i test the model with the oak-d i use a script from the DepthAI https://docs.luxonis.com/projects/api/en/develop/samples/MobileNet/video_mobilenet/ This is the RGB camera in Oak-D run against video that i have captured of the final environment.

The problem i get when running: most often have boundingboxes all over the screen and they are not "focusing" on anything special, the are just everywhere. in this case, when i change the n.setConfidenceThreshold(0.5) there is a point at 0.2 that either gives me no boundingboxes or to many boundingboxes not focusing on anything.

When i tried with 50000 steps i had a row of bounding boxes in the bottom of the screen (in a for the model "dead space") and not reacting at all on the cars coming into screen.

As i feel i´m stuck, not knowing what to do all ideas are welcome. I´m trying to research what to do but as i don´t know what the problem might be its tricky to know where to start.

The original photos captured for the labeling is in 16/9 aspect ratio, would it be possible to keep this ratio throughout the process? If so the positions for the different parking-spaces would be intact through the training (hopefully) enhancing the models ability to by geographical localization know where the different parking spots (and the road) is located.

I have found a way to test the model after training on images and this gives a great result.

My idea in the long run is to get a model working to be able to have something to "experiment" on to learn more.

I have read somewhere that someone having the same problem solved it by changing the label_map in the .record file. As i understand it the .record file is more or less a file consisting of tensors, would it be possible to check this file to see the order in the label_map or have i understood this wrong?

As mentioned above all advice/ideas are welcome, and keep in mind i´m new to this. This means that i could have done just a simple mistake.

Sorry for this long post and please let me know if there is anything i should elaborate/explain further.

best,

Martin

Upvotes: 1

Views: 572

Answers (1)

Iffa_Intel
Iffa_Intel

Reputation: 104

From OpenVINO perspective you may refer to this Object Detection Python Demo since this demo is rather similar to your aim. Models that are similar/close to this kind of implementation are also available.

The thing that you need to figure out is how are you going to train the model to differentiate whether the empty spot is taken or not. You'll need to decide stimuli that differentiate those, for example, in IoT, infrared sensor in the parking lot determines empty/occupied lot through a threshold value. Meanwhile, in AI, it's up to you as the developer to decide how to determine that concept. This might give you some idea. (They even share their source code).

Another thing to note is, the OpenCV OAK-D is not a part of the OpenVINO Toolkit. Hence, anything that specific to this, you'll need to reach their discussion forum or community.

Upvotes: 0

Related Questions