utkrisht patesaria
utkrisht patesaria

Reputation: 11

nvidia_p2p_get_pages() failing with error code -22

I am implementing NVIDIA GDS with the following hardware config:

Ubuntu 22.04
CUDA 12.1
Nvidia Drivers 530.30.2
MLNX driver - 5.8.0
NVIDA GeForce RTX 3090
Samsung 980 DC NVMe drive.
IOMMU is disabled
PCIe bar has been resized to that of VRAM size

GDS was installed and verified successfully

 GDS release version: 1.6.1.9
 nvidia_fs version:  2.15 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Supported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Disabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 1024
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 18014398509481980
 properties.posix_pool_slab_size_kb : 4 1024 16384 
 properties.posix_pool_slab_count : 128 64 32 
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 fs.gpfs.gds_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 execution.max_io_threads : 0
 execution.max_io_queue_depth : 128
 execution.parallel_io : false
 execution.min_io_threshold_size_kb : 1024
 execution.max_request_parallelism : 0
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA GeForce RTX 3090 bar:1 bar size (MiB):32768, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 Found ACS enabled for switch 0000:00:02.1
 IOMMU: disabled
 Platform verification succeeded

But when I was running their test benchmarks it was failing as below:

./gdsio_verify -f /media/nvme/write-test -d 0 -n 1 -s 1G 

warn: error opening log file: Permission denied, logging will be disabled
gpu index :0,file :/media/nvme/write-test, gpu buffer alignment :0, gpu buffer offset :0, gpu devptr offset :0, file offset :0, io_requested :1073741824, io_chunk_size :1073741824, bufregister :true, sync :1, nr ios :1, 
fsync :0, 
Batch mode: 0
cuFileRead returned error(ret=-1, step_size=1073741824, bytes_left=1073741824)
buffer deregister failed :device pointer lookup failure

Checking dmesg logs found:

nvidia-fs:nvfs_pin_gpu_pages:1292 Error ret -22 invoking nvidia_p2p_get_pages
                va_start=0x7f6792900000/va_end=0x7f67929fffff/rounded_size=0x100000/gpu_buf_length=0x100000

Digging up some articles I found that GPU Direct RDMA is supported only for Tesla/Quadro class GPU's. I am curious to know whats preventing RTX 3090 to support this, is it something on the hardware that's missing or some driver module?

I have been digging the up the code for nvidia-fs-2.51.3 to backtrack the issue, any help would be appreciated as I want to find the exact issue, and maybe if its possible to circumvent the issue via some code changes.

Also there aren't much articles on nvidia forum or github issues on this topic.

Upvotes: 1

Views: 247

Answers (0)

Related Questions