lemon
lemon

Reputation: 747

Why is Keras LSTM on CPU three times faster than GPU?

I use this notebook from Kaggle to run LSTM neural network.

I had started training of neural network and I saw that it is too slow. It is almost three times slower than CPU training.

After this I decided to find answer in this question on Stackoverflow and I applied a CuDNNLSTM (which runs only on GPU) instead of LSTM.

Hence, GPU perfomance became only 1 min per epoch and accuracy of model decreased on 3%.

Questions:

1) Does somebody know why GPU works slower than CPU in the classic LSTM layer? I do not understand why this happens.

2) Why when I use CuDNNLSTM instead of LSTM, training become much more faster and the accuracy of the model decrease?

P.S.:

My CPU: Intel Core i7-7700 Processor (8M Cache, up to 4.20 GHz)

My GPU: nVidia GeForce GTX 1050 Ti (4 GB)

Upvotes: 10

Views: 8102

Answers (4)

Ben Amadi
Ben Amadi

Reputation: 1

Implementation of GPU is a very complex process. It's not just typing with'GPU' or with'cpu' and you see it's done. First you should mention what was your way of implementing gpu in details then it will reveal why and where things went wrong.

Upvotes: -1

ericbdevil
ericbdevil

Reputation: 71

I had a similar problem today and found two things that may be helpful to others (this is a regression problem on a data set with ~2.1MM rows, running on a machine with 4 P100 GPUs):

  1. Using the CuDNNLSTM layer instead of the LSTM layer on a GPU machine reduced the fit time from ~13500 seconds to ~400 seconds per epoch.
  2. Increasing the batch size (~500 to ~4700) reduced it to ~130 seconds per epoch.

Reducing the batch size has increase loss and val loss, so you'll need to make a decision about the trade offs you want to make.

Upvotes: 5

Ashok Kumar Jayaraman
Ashok Kumar Jayaraman

Reputation: 3095

In Keras, the fast LSTM implementation with CuDNN.

model.add(CuDNNLSTM(units, input_shape=(len(X_train), len(X_train[0])), return_sequences=True))

It can only be run on the GPU with the TensorFlow backend.

Upvotes: 2

Richard
Richard

Reputation: 61459

Guessing it's just a different, better implementation and, if the implementation is different, you shouldn't expect identical results.

In general, efficiently implementing an algorithm on a GPU is hard and getting maximum performance requires architecture-specific implementations. Therefore, it wouldn't be surprising if an implementation specific to Nvidia's GPUs had enhanced performance versus a general implementation for GPUs. It also wouldn't be surprising that Nvidia would sink significantly more resources into accelerating their code for their GPUs versus than would a team working on a general CNN implementation.

The other possibility is that the data type used on the backend has changed from double- to single- or even half-precision float. The smaller data types mean you can crunch more numbers faster at the cost of accuracy. For NN applications this is often acceptable because no individual number needs to be especially accurate for the net to produce acceptable results.

Upvotes: 9

Related Questions