edn
edn

Reputation: 2183

tensorflow gpu is only running on CPU

I installed Anaconda-Navigatoron Windows 10 and all necessary Nvidia/Cuda packages, created a new environment called tensorflow-gpu-env, updated PATH information, etc. When I run a model (build by using tensorflow.keras), I see that CPU utilization increases significantly, GPU utilization is 0%, and the model just does not train.

I run a couple of tests to make sure how things look:

print(tf.test.is_built_with_cuda())
True

The above output ('True') looks correct.

Another try:

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

Output:

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 1634313269296444741
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 1478485606
locality {
  bus_id: 1
  links {
  }
}
incarnation: 16493618810057409699
physical_device_desc: "device: 0, name: GeForce 940MX, pci bus id: 0000:01:00.0, compute capability: 5.0"
]

So far so good... Later in my code, I start the training with the following code:

history = merged_model.fit_generator(generator=train_generator,
                                     epochs=60,
                                     verbose=2,
                                     callbacks=[reduce_lr_on_plateau],
                                     validation_data=val_generator,
                                     use_multiprocessing=True,
                                     max_queue_size=50,
                                     workers=3)

I also tried to run the training as following:

with tf.device('/gpu:0'):
    history = merged_model.fit_generator(generator=train_generator,
                                         epochs=60,
                                         verbose=2,
                                         callbacks=[reduce_lr_on_plateau],
                                         validation_data=val_generator,
                                         use_multiprocessing=True,
                                         max_queue_size=50,
                                         workers=3)

No matter how I start the training, it never starts the training, I keep seeing increased CPU utilization with 0% GPU utilization.

Why is my tensorflow-gpu installation is only using the CPU? Spent HOURS with literally no progress.

ADDENDUM

When I run conda list on the console, I see the following regarding tensorflow:

tensorflow-base           1.11.0          gpu_py36h6e53903_0
tensorflow-gpu            1.11.0                    <pip>

What is this tensorflow-base? Can it cause a problem? Before installing tensorflow-gpu, I made sure that I uninstalled tensorflow and tensorflow-gpu by using both conda and pip; and then installed tensorflow-gpu by using pip. I am not sure if this tensorflow-base came with my tensorflow-gpu installation.

ADDENDUM 2 It looks like tensorflow-base was a part of conda because I could uninstall it with conda uninstall tensorflow-base. I still have tensorflow-gpu installation in place but I now cannot import tensorflow anymore. It says "No module named tensorflow". It looks like my conda environment is not seeing my tensorflor-gpu installation. I am quite confused at the moment.

Upvotes: 3

Views: 2360

Answers (2)

edn
edn

Reputation: 2183

@Smokrow, thank you for your answers above. It appears to be the case that Keras seems to have problems with multiprocessing in Windows platforms.

history = merged_model.fit_generator(generator=train_generator,
                                     epochs=60,
                                     verbose=2,
                                     callbacks=[reduce_lr_on_plateau],
                                     validation_data=val_generator,
                                     use_multiprocessing=True,
                                     max_queue_size=50,
                                     workers=3)

The piece of code above causes the Keras to hang and literally no progress is seen. If the user is runing his/her code on Windows, use_multiprocessor needs to be set to False! Otherwise, it does not work. Interestingly, workers can still be set to a number that is greater than one and it still gives performance benefits. I am having difficulties to understand what really is happening in the background but it does give performance improvement. So the following piece of code made it work.

history = merged_model.fit_generator(generator=train_generator,
                                     epochs=60,
                                     verbose=2,
                                     callbacks=[reduce_lr_on_plateau],
                                     validation_data=val_generator,
                                     use_multiprocessing=False,  # CHANGED
                                     max_queue_size=50,
                                     workers=3)

Upvotes: 1

Smokrow
Smokrow

Reputation: 241

Depending on the size of your network it could be, that your CPU is loading data most of the time.

Since you are using Python Generators most of your time will be spend in Python Code opening your files. The Generator is probably bottlenecking your pipeline.

Once the data is loaded it is probably evaluated instantly on the GPU resulting in almost 0% GPU utilization since your Gpu keeps waiting for new data. You could try using TensorFlows dataset API. tfrecords are extremely fast in load times. Take a look at this article

Upvotes: 1

Related Questions