MXNet distributed training accuracy

Question

I am using MXNet to finetune Resnet model on Caltech 256 dataset from the following example:

https://mxnet.incubator.apache.org/how_to/finetune.html

I am primarily doing it for a POC to test distributed training (which I'll later use in my actual project).

First I ran this example on a single machine with 2 GPUs for 8 epochs. I took around 20 minutes and the final validation accuracy was 0.809072.

Then I ran it on 2 machines (identical, each with 2 GPUs) with distributed setting and partitioned the training data in half for these two machines (using num_parts and part_index).

8 epochs took only 10 minutes, but the final validation accuracy was only 0.772847 (highest of the two). Even when I used 16 epochs, I was only able to achieve 0.797006.

So my question is that is it normal? I primarily want to use distributed training to reduce training time. But if it takes twice or more epochs to achieve the same accuracy, then what's the advantage? Maybe I am missing something.

I can post my code and run command if required.

Thanks

EDIT

Some more info to help with the answer:

MXNet version: 0.11.0

Topology: 2 workers (each on a separate machine)

Code: https://gist.github.com/reactivefuture/2a1f9dcd3b27c0fe8215b4e3d25056ce

Command to start:

python3 mxnet/tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python3 training.py --kv-store dist_sync --gpus 0,1

I have used a hacky way to do partitioning (using IP addresses) since I couldn't get kv.num_workers and kv.rank to work.

Viacheslav V Kovalevskyi · Accepted Answer

So my question is that is it normal? I primarily want to use distributed training to reduce training time. But if it takes twice or more epochs to achieve the same accuracy, then what's the advantage?

No it is not normally, distributed training, indeed, should be used to speed up the training process, not to slow it down. However there are many ways to do it in a wrong way.

Based on the provided data it feels like workers are still running in the single training('device') mode, or maybe kv_store is created incorrectly. Therefore each worker just trains model himself. In such case you should see validation result after 16 epoch been close to the single machine with 8 epoch (simply because in cluster you are splitting the data). In your case it is 0.797006 vs 0.809072. Depends on how many experiments you have executed this numbers might be treated as equal. I would focus my investigation on the way how cluster bootstrapped.

If you need to dive deeper on the topic how to create kv_store(or what is this) and use it with the distributed training please see this article.

In general in order to give a better answer, in the future pleace provide at least the following information:

what is the version of MXNet?
what is the topology of the cluster, with the following information:
- how many logical workers are used;
- how many servers are used (are they on the same machines with workers)?
how do you start the training (ideally with the code)
if it is not possible to provide code, at least specify type of kv_store
how do you partitioning data between worker

EDIT

Even though call that starts training looks correct:

python3 mxnet/tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet python3 training.py --kv-store dist_sync --gpus 0,1

There is, at least one problem in the training.py itself. If you look here, it actually does not respect type of kv-store from the input argument and just uses 'device'. Therefore all workers are trining training the model separatly(and not in a cluster). I believe fixing this one line should help.

I would again advice to read the article in order to familiarize yourself in the topic how MXNet cluster is working. Such problems can be easily spotted by analyzing debug logs and observing that there are no kv-store created and therefore cluster is not training anything (only stand-alone machines are doing something).

MXNet distributed training accuracy

Answers (1)

Related Questions