Getting -NaN during DARKNET training, what am i doing wrong?

Question

I want to train YOLOv3 to detect humans on aerial pictures. Im using VisDrone Object Detection in Images dataset: github.com/VisDrone/VisDrone-Dataset

I wrote a script that converted labels to darknet format so that i can train it according to pjreddie "Training YOLO on COCO" instructions, I double checked if my converted labels match the objects correctly and they do, I also created a proper coco.names file according to labels description on VisDrone2018-DET-toolkit on github. I created trainvalno5k.txt file by running

python 5kGenerator.py > trainvalno5k.txt

5kGenerator.py:

import os

for filename in os.listdir('images'):
    print( os.path.abspath( os.path.join( 'images', filename )))

I modified coco.data file, this is the result:

classes= 12
train  = /mnt/d/Olaf/Documents/Python/VisDrone2019-DET-train/trainvalno5k.txt
#valid  = /mnt/d/Olaf/Documents/Python/VisDrone2019-DET-train/5k.txt
#valid = data/coco_val_5k.list
names  = /mnt/d/Olaf/Documents/Python/VisDrone2019-DET-train/coco.names
backup = backup
#eval=coco

I commented valid out because as far as I understand its for checking results and valid dataset is irrelevant for training (I didn't bother to create it).

When I run ./darknet detector train cfg/coco.data cfg/yolov3.cfg darknet53.conv.74 stuff loads correctly and training starts, but every few lines i get -nan messages and I have no idea why and if that has impact on the end result, example:

Loading weights from darknet53.conv.74...Done!
Learning Rate: 0.001, Momentum: 0.9, Decay: 0.0005
Resizing
416
Loaded: 1.122782 seconds
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.428162, .5R: -nan, .75R: -nan,  count: 0
Region 94 Avg IOU: 0.409795, Class: 0.690346, Obj: 0.091164, No Obj: 0.519810, .5R: 0.000000, .75R: 0.000000,  count: 1
Region 106 Avg IOU: 0.157575, Class: 0.532119, Obj: 0.333807, No Obj: 0.417611, .5R: 0.045685, .75R: 0.000000,  count: 197
Region 82 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.427261, .5R: -nan, .75R: -nan,  count: 0

Its pretty slow because im testing this on CPU, proper training will be done on Nvidia Quadro

Can you please explain this behaviour and what can i do to fix that -nan problem?

Ps. Im using Ubuntu terminal on Windows 10, I dont know if thats important.

BarzanHayati · Accepted Answer

It's better to using AlexeyAB repository for training.
You should use Validation set or test for evaluation of trained networks on your data.
I have trained a 26 classes dataset and I ignored 5k classes & you have 12 classes.
for Nan value it's better to decrease Learning Rate in starting of training. and then increasing it.
you could train your network in windows & Linux and it's not matter.

Getting -NaN during DARKNET training, what am i doing wrong?

Answers (1)

Related Questions