Reputation: 11
I'm trying to run a singularity/nextflow script on an HPC. This script utilizes tensorflow, which is specified in the docker image I used to pull the initial .sif file from lpryszcz/deeplexicon:latest. However, whenever I try to import tensorflow within my nextflow pipeline like this:
python3 -c "import tensorflow as tf; print('TensorFlow version:', tf.__version__)" >> ${params.resultsDir}/cuda_paths.txt
I am presented with this error:
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
Failed to load the native TensorFlow runtime.
I initially attempted to fix this by determining where my libcuda.so file was:
ldconfig -p | grep libcuda
Which turned out to be:
libcudart.so.10.0 (libc6,x86-64) => /usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so.10.0
And then simply adding the libucda.so.10.0 path directly to LD_LIBRARY_PATH:
LD_LIBRARY_PATH=/usr/local/cuda-10.0/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
However the issue still persists.
For more context, the input files for my job are as follows:
The nextflow pipeline script looks like this:
#!/usr/bin/env nextflow
process demultiplex {
input:
path fast5, from: params.fast5
output:
script:
if(params.demultiplex)
"""
mkdir -p ${params.resultsDir}
python3 -c "import tensorflow as tf; print('TensorFlow version:', tf.__version__)" >> ${params.resultsDir}/cuda_paths.txt
python3 /deeplexicon/deeplexicon.py dmux -p ${fast5} -f multi -m models/resnet20-final.h5 > ${params.resultsDir}/output.tsv
"""
else
"""
echo "Skipped"
"""
}
Which always throws the error when trying to import tensorflow.
And the config file like this (with some changes made for anonymity):
params{
// Path to the sample description file
fast5 = "/some_path/Deeplexicon/RNA004/fast5_pass"
resultsDir = "/some_path/Deeplexicon/7_16_2"
demultiplex = true
}
singularity {
enabled = true
autoMounts = false
cacheDir = '/some_path/work/singularity_cache'
}
tower {
enabled = false
endpoint = '-'
accessToken = 'nextflowTowerToken'
}
process{
cpus = 1
executor = 'slurm'
queue = 'pascal_gpu'
perJobMemLimit = true
containerOptions='--bind (All of the CUDA paths from module show CUDA/10.1.243)'
withName:demultiplex {
container = 'deeplexicon_latest.sif'
clusterOptions = '--gres=gpu:1'
memory = { params.demultiplex ? 8.GB + (2.GB * (task.attempt-1)) : 2.GB }
errorStrategy = { task.exitStatus == 130 ? 'retry' : 'terminate' }
maxRetries = 3
}
}
Finally, my slurm job submission script looks like this:
#!/bin/bash
#SBATCH --job-name=deeplexicon_RNA002
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --mem=2G
#SBATCH --time=1:00:00
#SBATCH --partition=pascal_gpu
#SBATCH --gres=gpu:1
module purge
module load legacy-software
module load CUDA/10.0.130
module load Java/11.0.20
export APPTAINER_TMPDIR=$SCRATCH_VO_USER/apptainer-tmp
export APPTAINER_CACHEDIR=$SCRATCH_USER/apptainer-cache
export NXF_SINGULARITY_CACHEDIR=/some_path/work/singularity_cache
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/targets/x86_64-linux/lib:$LD_LIBRARY_PATH
/some_path/tools/nextflow/nextflow-22.10.0-all -c /some_path/Deeplexicon/nextflow_scripts/deeplexicon.conf run /some_path/Deeplexicon/nextflow_scripts/deeplexicon.nf
Really not sure why this is not recognizing tensorflow upon import.
Upvotes: 1
Views: 54
Reputation: 54562
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
One way might be to append --nv
to your singularity.runOptions
, to enable Nvidia support, for example:
singularity {
enabled = true
runOptions = '--nv'
}
If that fails, you might like to try binding your system's libcuda.so.1
to somewhere in the container's $LD_LIBRARY_PATH
. You can use the singularity.runOptions
(or docker.runOptions
if using Docker) for this. For example:
singularity {
enabled = true
runOptions = '--bind /usr/lib/libcuda.so.1:/usr/local/nvidia/lib/libcuda.so.1'
}
Tested using Docker:
$ cat main.nf
process demultiplex {
debug true
container 'lpryszcz/deeplexicon:latest'
script:
"""
python3 -c "import tensorflow as tf; print('TensorFlow version:', tf.__version__)"
"""
}
workflow {
demultiplex()
}
$ cat nextflow.config
docker {
enabled = true
runOptions = '-v /usr/lib/libcuda.so.1:/usr/local/nvidia/lib/libcuda.so.1'
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 24.04.3
Launching `main.nf` [astonishing_swirles] DSL2 - revision: 4d3ac3440f
executor > local (1)
[4e/67997d] process > demultiplex [100%] 1 of 1 ✔
TensorFlow version: 1.13.1
Upvotes: 0