Reputation: 11
I am implementing NVIDIA GDS with the following hardware config:
Ubuntu 22.04
CUDA 12.1
Nvidia Drivers 530.30.2
MLNX driver - 5.8.0
NVIDA GeForce RTX 3090
Samsung 980 DC NVMe drive.
IOMMU is disabled
PCIe bar has been resized to that of VRAM size
GDS was installed and verified successfully
GDS release version: 1.6.1.9
nvidia_fs version: 2.15 libcufile version: 2.12
Platform: x86_64
============
ENVIRONMENT:
============
=====================
DRIVER CONFIGURATION:
=====================
NVMe : Supported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
BeeGFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
--Mellanox PeerDirect : Disabled
--rdma library : Not Loaded (libcufile_rdma.so)
--rdma devices : Not configured
--rdma_device_status : Up: 0 Down: 0
=====================
CUFILE CONFIGURATION:
=====================
properties.use_compat_mode : true
properties.force_compat_mode : false
properties.gds_rdma_write_support : true
properties.use_poll_mode : false
properties.poll_mode_max_size_kb : 4
properties.max_batch_io_size : 128
properties.max_batch_io_timeout_msecs : 5
properties.max_direct_io_size_kb : 1024
properties.max_device_cache_size_kb : 131072
properties.max_device_pinned_mem_size_kb : 18014398509481980
properties.posix_pool_slab_size_kb : 4 1024 16384
properties.posix_pool_slab_count : 128 64 32
properties.rdma_peer_affinity_policy : RoundRobin
properties.rdma_dynamic_routing : 0
fs.generic.posix_unaligned_writes : false
fs.lustre.posix_gds_min_kb: 0
fs.beegfs.posix_gds_min_kb: 0
fs.weka.rdma_write_support: false
fs.gpfs.gds_write_support: false
profile.nvtx : false
profile.cufile_stats : 0
miscellaneous.api_check_aggressive : false
execution.max_io_threads : 0
execution.max_io_queue_depth : 128
execution.parallel_io : false
execution.min_io_threshold_size_kb : 1024
execution.max_request_parallelism : 0
=========
GPU INFO:
=========
GPU index 0 NVIDIA GeForce RTX 3090 bar:1 bar size (MiB):32768, IOMMU State: Disabled
==============
PLATFORM INFO:
==============
Found ACS enabled for switch 0000:00:02.1
IOMMU: disabled
Platform verification succeeded
But when I was running their test benchmarks it was failing as below:
./gdsio_verify -f /media/nvme/write-test -d 0 -n 1 -s 1G
warn: error opening log file: Permission denied, logging will be disabled
gpu index :0,file :/media/nvme/write-test, gpu buffer alignment :0, gpu buffer offset :0, gpu devptr offset :0, file offset :0, io_requested :1073741824, io_chunk_size :1073741824, bufregister :true, sync :1, nr ios :1,
fsync :0,
Batch mode: 0
cuFileRead returned error(ret=-1, step_size=1073741824, bytes_left=1073741824)
buffer deregister failed :device pointer lookup failure
Checking dmesg logs found:
nvidia-fs:nvfs_pin_gpu_pages:1292 Error ret -22 invoking nvidia_p2p_get_pages
va_start=0x7f6792900000/va_end=0x7f67929fffff/rounded_size=0x100000/gpu_buf_length=0x100000
Digging up some articles I found that GPU Direct RDMA is supported only for Tesla/Quadro class GPU's. I am curious to know whats preventing RTX 3090 to support this, is it something on the hardware that's missing or some driver module?
I have been digging the up the code for nvidia-fs-2.51.3 to backtrack the issue, any help would be appreciated as I want to find the exact issue, and maybe if its possible to circumvent the issue via some code changes.
Also there aren't much articles on nvidia forum or github issues on this topic.
Upvotes: 1
Views: 247