Reputation: 2621
I am trying to run the Keras MINST example using tensorflow-gpu with a Geforce 2080. My environment is Anaconda on a Linux system.
I am running the unmodified example from a command line python session. I get the following output:
Using TensorFlow backend.
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce RTX 2080, pci bus id: 0000:01:00.0, compute capability: 7.5
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
conv2d_1/random_uniform/RandomUniform: (RandomUniform):
/job:localhost/replica:0/task:0/device:GPU:0
conv2d_1/random_uniform/sub: (Sub):
/job:localhost/replica:0/task:0/device:GPU:0
conv2d_1/random_uniform/mul: (Mul):
/job:localhost/replica:0/task:0/device:GPU:0
conv2d_1/random_uniform: (Add):
/job:localhost/replica:0/task:0/device:GPU:0
[...]
The last lines I receive are:
training/Adadelta/Const_31: (Const): /job:localhost/replica:0/task:0/device:GPU:0
training/Adadelta/mul_46/x: (Const): /job:localhost/replica:0/task:0/device:GPU:0
training/Adadelta/mul_47/x: (Const): /job:localhost/replica:0/task:0/device:GPU:0
Segmentation fault (core dumped)
From reading around I assumed this might be a memory problem and added these lines to prevent the GPU from running out of memory:
config = tf.ConfigProto(log_device_placement=True)
config.gpu_options.per_process_gpu_memory_fraction=0.3
K.tensorflow_backend.set_session(tf.Session(config=config))
Checking with the nvidia-smi
tool that the GPU is actually used (watch -n1 nvidia-smi
)I can confirm from the following output (in this run no per_process_gpu_memory_fraction
was set to 1):
I suspect a version incompatibility somewhere between CUDA, Keras and Tensorflow to be the issue, but I don't know, how to debug this.
What debugging measures are available to get to the bottom of this? What other issues might be the reason for this segfault?
EDIT: I experimented further and replacing the model with this code works fine:
model = keras.Sequential([
keras.layers.Flatten(input_shape=input_shape),
keras.layers.Dense(128, activation=tf.nn.relu),
keras.layers.Dense(10, activation=tf.nn.softmax)
])
However once I introduce a convolution layer like so
model = keras.Sequential([
keras.layers.Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape),
# keras.layers.Flatten(input_shape=input_shape),
keras.layers.Flatten(),
keras.layers.Dense(128, activation=tf.nn.relu),
keras.layers.Dense(10, activation=tf.nn.softmax)
])
then I again get the aforementioned segfault.
All packets have been installed through Anaconda. I have installed
EDIT: I tried the same code in a non anaconda environment and it works flawlessly. I would prefer to use anaconda though to avoid system updates breaking things.
Upvotes: 9
Views: 6379
Reputation: 248
I had the exact same problem on a very similar system as Francois but using a RTX2070 on which I could reliably reproduce the segmentation fault error when using the conv2d function executed on the GPU. My setting:
I finally solved it by building tensorflow from source into a new conda environment. For a fantastic guide see e.g. the following link: https://gist.github.com/Brainiarc7/6d6c3f23ea057775b72c52817759b25c
This is basically like any other build-tensorflow-from-source guide and consisted in my case of the following steps:
./configure
bazel build
command (see link for details)Some minor issues came up during the build, one of which was solved by installing 3 packages manually, using:
pip install keras_applications==1.0.4 --no-deps
pip install keras_preprocessing==1.0.2 --no-deps
pip install h5py==2.8.0
which I found out using this answer here: Error Compiling Tensorflow From Source - No module named 'keras_applications'
conv2d now works like a charm when using the gpu!
However, since all this took a fairly long time (building from source takes over an hour, not counting the search for the solution on the internet) I recommend to make a backup of the system after you get it working, e.g. using timeshift or any other program that you like.
Upvotes: 3
Reputation: 86
Build the tensorflow from source(r1.13) .Conv2D segmentation fault fixed.
follow Build from Source
my GPU : RTX 2070 Ubuntu 16.04 Python 3.5.2 Nvidia Driver 410.78 CUDA - 10.0.130 cuDNN-10.0 - 7.4.2.24 TensorRT-5.0.0 Compute Capability: 7.5
Build : tensorflow-1.13.0rc0-cp35-cp35m-linux_x86_64
Download prebuilt from https://github.com/tensorflow/tensorflow/issues/22706
Upvotes: 7
Reputation: 556
I had the same Conv2D problem with:
Best advice was from this link: https://github.com/tensorflow/tensorflow/issues/24383
So a fix should come with Tensorflow 1.13. In the meantime, using Tensorflow 1.13 nightly build (Dec 26, 2018) + using tensorflow.keras instead of keras solved the issue.
Upvotes: 2