Brendan
Brendan

Reputation: 11

Singularity/Nexflow script won't load Tensorflow from .sif on cluster

I'm trying to run a singularity/nextflow script on an HPC. This script utilizes tensorflow, which is specified in the docker image I used to pull the initial .sif file from lpryszcz/deeplexicon:latest. However, whenever I try to import tensorflow within my nextflow pipeline like this:

 python3 -c "import tensorflow as tf; print('TensorFlow version:', tf.__version__)" >> ${params.resultsDir}/cuda_paths.txt

I am presented with this error:

  ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
  
  
  Failed to load the native TensorFlow runtime.

I initially attempted to fix this by determining where my libcuda.so file was:

ldconfig -p | grep libcuda

Which turned out to be: libcudart.so.10.0 (libc6,x86-64) => /usr/local/cuda-10.0/targets/x86_64-linux/lib/libcudart.so.10.0

And then simply adding the libucda.so.10.0 path directly to LD_LIBRARY_PATH:

LD_LIBRARY_PATH=/usr/local/cuda-10.0/targets/x86_64-linux/lib:$LD_LIBRARY_PATH

However the issue still persists.

For more context, the input files for my job are as follows:

The nextflow pipeline script looks like this:

#!/usr/bin/env nextflow 
process demultiplex {
    input:
    
    path fast5, from: params.fast5

    output:

    script:
    if(params.demultiplex)
    """
    mkdir -p ${params.resultsDir}

    python3 -c "import tensorflow as tf; print('TensorFlow version:', tf.__version__)" >> ${params.resultsDir}/cuda_paths.txt

    python3 /deeplexicon/deeplexicon.py dmux -p ${fast5} -f multi -m models/resnet20-final.h5 > ${params.resultsDir}/output.tsv
    """
    else
    """
        echo "Skipped"
    """
}

Which always throws the error when trying to import tensorflow.

And the config file like this (with some changes made for anonymity):

 params{
    // Path to the sample description file
    fast5 = "/some_path/Deeplexicon/RNA004/fast5_pass"
    resultsDir = "/some_path/Deeplexicon/7_16_2"
    demultiplex = true
}

singularity {
    enabled = true
    autoMounts = false
    cacheDir = '/some_path/work/singularity_cache'

}

tower {
    enabled = false
    endpoint = '-'
    accessToken = 'nextflowTowerToken'
}

process{
    cpus = 1
    executor = 'slurm'
    queue = 'pascal_gpu'
    perJobMemLimit = true

    containerOptions='--bind (All of the CUDA paths from module show CUDA/10.1.243)'

    withName:demultiplex {
    container = 'deeplexicon_latest.sif'
    clusterOptions = '--gres=gpu:1'
    memory = { params.demultiplex ? 8.GB + (2.GB * (task.attempt-1)) : 2.GB }
    errorStrategy = { task.exitStatus == 130 ? 'retry' : 'terminate' }
    maxRetries = 3
}

}

Finally, my slurm job submission script looks like this:

#!/bin/bash
#SBATCH --job-name=deeplexicon_RNA002
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --mem=2G
#SBATCH --time=1:00:00
#SBATCH --partition=pascal_gpu
#SBATCH --gres=gpu:1


module purge
module load legacy-software
module load CUDA/10.0.130
module load Java/11.0.20


export APPTAINER_TMPDIR=$SCRATCH_VO_USER/apptainer-tmp
export APPTAINER_CACHEDIR=$SCRATCH_USER/apptainer-cache

export NXF_SINGULARITY_CACHEDIR=/some_path/work/singularity_cache

export LD_LIBRARY_PATH=/usr/local/cuda-10.0/targets/x86_64-linux/lib:$LD_LIBRARY_PATH


/some_path/tools/nextflow/nextflow-22.10.0-all -c /some_path/Deeplexicon/nextflow_scripts/deeplexicon.conf run /some_path/Deeplexicon/nextflow_scripts/deeplexicon.nf

Really not sure why this is not recognizing tensorflow upon import.

Upvotes: 1

Views: 54

Answers (1)

Steve
Steve

Reputation: 54562

ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

One way might be to append --nv to your singularity.runOptions, to enable Nvidia support, for example:

singularity {
  enabled = true
  runOptions = '--nv'
}

If that fails, you might like to try binding your system's libcuda.so.1 to somewhere in the container's $LD_LIBRARY_PATH. You can use the singularity.runOptions (or docker.runOptions if using Docker) for this. For example:

singularity {
  enabled = true
  runOptions = '--bind /usr/lib/libcuda.so.1:/usr/local/nvidia/lib/libcuda.so.1'
}

Tested using Docker:

$ cat main.nf
process demultiplex {

  debug true

  container 'lpryszcz/deeplexicon:latest'

  script:
  """
  python3 -c "import tensorflow as tf; print('TensorFlow version:', tf.__version__)" 
  """
}

workflow {

  demultiplex()
}
$ cat nextflow.config 
docker {
  enabled = true
  runOptions = '-v /usr/lib/libcuda.so.1:/usr/local/nvidia/lib/libcuda.so.1'
}

Results:

$ nextflow run main.nf 

 N E X T F L O W   ~  version 24.04.3

Launching `main.nf` [astonishing_swirles] DSL2 - revision: 4d3ac3440f

executor >  local (1)
[4e/67997d] process > demultiplex [100%] 1 of 1 ✔
TensorFlow version: 1.13.1

Upvotes: 0

Related Questions