xwt1
xwt1

Reputation: 15

I can't find /usr/local/cuda-<x>.<y>/gds/samples after I install cuda tookit and driver

I want to use GPUDirect Storage. I follow the instructions in https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html#mofed-req-install to install it. The install details are as follow:

  1. I firstly install cuda tookit and driver from here: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network.

  2. Then, According to https://developer.nvidia.com/gpudirect-storage , GDS should be now part of CUDA. But when I have a look in gds file to check whether the gds file is right , I found:

    (base) no@ho-4:/usr/local/cuda-12.6/gds$ ls -l
    total 32
    -rw-r--r-- 1 root root 10756  Aug 22 13:36 cufile.json
    -rw-r--r-- 1 root root 14290  Aug 22 13:36 README
    drwxr-xr-x 2 root root  4096  Sep 23 15:59 tools
    

    The right example one that nvidia docs give is:

    $ ls -lh /usr/local/cuda-X.Y/gds/
    total 20K
    -rw-r--r-- 1 root root 8.4K Mar 15 13:01 README
    drwxr-xr-x 2 root root 4.0K Mar 19 12:29 samples
    drwxr-xr-x 2 root root 4.0K mar 19 10:28 tools
    

I can't find samples folders and find another unknown file cufile.json. Samples folder should contain example usage program to test the gds functionality. It is upset that I don't have it. Could someone please help me to have my sample folder back ??o_o

By the way, I install MLNX_OFED because I need NVME support. when I installed MLNX_OFED using instructions in https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html#mofed-req-install , I found that:

(base) no@ho-4::/usr/local/cuda-12.6/gds/tools$ ./gdscheck.py -v
warn: error opening log file: Permission denied, logging will be disabled
 GDS release version: 1.11.1.6
 nvidia_fs not loaded, operating on compatible mode. libcufile version: 2.12
 Platform: x86_64

nvidia_fs module is not loaded. I install it using:

sudo apt install nvidia-fs

But it seems that it is not a related new version of nvidia_fs. Could someone also tell me how can I download related new version of nvidia_fs?o_o

Upvotes: 1

Views: 70

Answers (1)

EricM K
EricM K

Reputation: 1

Unfortunately getting GDS to work is not a super simple task. Primarily because of the fragmented documentation I guess.

Regarding the first part about the samples folder: The samples are getting installed with the gds-tools apt package. It also contains the tools gdscheck and gdsio so I'm very surprised that you have gdscheck but no samples. Maybe try a reinstall.

For the gds installation I'd recommend having a look at the DGX OS 5 user guide. It gives you some quite detailed instructions on how to do things even if you are on plain Ubuntu and not the GDX OS.

Since gds version 12.2.1-1 you need the open kernel driver instead of the proprietary one. The DGX OS user manual describes what you have to do if you want to install the last compatible version that relies on the proprietary driver.

In the line:

sudo apt install nvidia-gds-12-2=12.2.1-1  nvidia dkms-${NVIDIA_DRV_VERSION}-server

my recommendation is to append --dry-run as it gives you a chance to check what is getting installed as this command might uninstall your current nvidia-driver for example if you don't have the server one (especially dangerous with even more obscure drivers for example in a VM with a vsphere grid driver).

Further when you managed to correctly install the nvdia-fs driver and have the libcufile.so on your system gdscheck -p should work but you also have to make sure that your filesystem supports gds. For ext4 this for example means to explicitly set the journaling mode (see here).

Upvotes: 0

Related Questions