RobbieTheK
RobbieTheK

Reputation: 198

OpenMPI 4.1.1 There was an error initializing an OpenFabrics device Infinband Mellanox MT28908

Similar to the discussion at MPI hello_world to test infiniband, we are using OpenMPI 4.1.1 on RHEL 8 with 5e:00.0 Infiniband controller [0207]: Mellanox Technologies MT28908 Family [ConnectX-6] [15b3:101b], we see this warning with mpirun:

WARNING: There was an error initializing an OpenFabrics device.

  Local host:   xxxx
  Local device: mlx5_0
---------------------------------------------

Using this STREAM benchmark here are some verbose logs:

mpirun -mca pml_base_verbose 100 -mca btl_base_verbose 100 -mca mca_base_verbose 100 -mca btl_openib_verbose true  -mca pml ucx --mca orte_base_help_aggregate 0 --mca opal_warn_on_missing_libcuda 0 -np 1 --oversubscribe ./stream_mpi --oversubscribe
[g183:270526] mca: base: components_register: registering framework btl components
[g183:270526] mca: base: components_register: found loaded component openib
[g183:270526] mca: base: components_register: component openib register function successful
[g183:270526] mca: base: components_register: found loaded component sm
[g183:270526] mca: base: components_register: found loaded component tcp
[g183:270526] mca: base: components_register: component tcp register function successful
[g183:270526] mca: base: components_register: found loaded component self
[g183:270526] mca: base: components_register: component self register function successful
[g183:270526] mca: base: components_register: found loaded component vader
[g183:270526] mca: base: components_register: component vader register function successful
[g183:270526] mca: base: components_register: found loaded component smcuda
[g183:270526] mca: base: components_register: component smcuda register function successful
[g183:270526] mca: base: components_open: opening btl components
[g183:270526] mca: base: components_open: found loaded component openib
[g183:270526] mca: base: components_open: component openib open function successful
[g183:270526] mca: base: components_open: found loaded component tcp
[g183:270526] mca: base: components_open: component tcp open function successful
[g183:270526] mca: base: components_open: found loaded component self
[g183:270526] mca: base: components_open: component self open function successful
[g183:270526] mca: base: components_open: found loaded component vader
[g183:270526] mca: base: components_open: component vader open function successful
[g183:270526] mca: base: components_open: found loaded component smcuda
[g183:270526] btl: smcuda: cuda_max_send_size=131072, max_send_size=32768, max_frag_size=131072
[g183:270526] mca: base: components_open: component smcuda open function successful
[g183:270526] select: initializing btl component openib
[g183:270526] Checking distance from this process to device=mlx5_0
[g183:270526] hwloc_distances->nbobjs=4
[g183:270526] hwloc_distances->values[0]=10
[g183:270526] hwloc_distances->values[1]=21
[g183:270526] hwloc_distances->values[2]=11
[g183:270526] hwloc_distances->values[3]=21
[g183:270526] ibv_obj->type set to NULL
[g183:270526] Process is bound: distance to device is 0.000000
[g183][[11854,1],0][btl_openib_ini.c:172:opal_btl_openib_ini_query] Querying INI files for vendor 0x02c9, part ID 4123
[g183][[11854,1],0][btl_openib_ini.c:188:opal_btl_openib_ini_query] Found corresponding INI values: Mellanox ConnectX6
[g183][[11854,1],0][btl_openib_ini.c:172:opal_btl_openib_ini_query] Querying INI files for vendor 0x0000, part ID 0
[g183][[11854,1],0][btl_openib_ini.c:188:opal_btl_openib_ini_query] Found corresponding INI values: default
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   xxxx
  Local device: mlx5_0
--------------------------------------------------------------------------
[g183:270526] select: init of component openib returned failure
[g183:270526] mca: base: close: component openib closed
[g183:270526] mca: base: close: unloading component openib
[g183:270526] select: initializing btl component tcp
[g183:270526] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[g183:270526] btl: tcp: Found match: 127.0.0.1 (lo)
[g183:270526] btl:tcp: Attempting to bind to AF_INET port 1024
[g183:270526] btl:tcp: Successfully bound to AF_INET port 1024
[g183:270526] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[g183:270526] btl:tcp: examining interface eth0
[g183:270526] btl:tcp: using ipv6 interface eth0
[g183:270526] btl:tcp: examining interface ib0
[g183:270526] btl:tcp: using ipv6 interface ib0
[g183:270526] select: init of component tcp returned success
[g183:270526] select: initializing btl component self
[g183:270526] select: init of component self returned success
[g183:270526] select: initializing btl component vader
[g183:270526] select: init of component vader returned failure
[g183:270526] mca: base: close: component vader closed
[g183:270526] mca: base: close: unloading component vader
[g183:270526] select: initializing btl component smcuda
[g183:270526] select: init of component smcuda returned failure
[g183:270526] mca: base: close: component smcuda closed
[g183:270526] mca: base: close: unloading component smcuda
[g183:270526] mca: base: components_register: registering framework pml components
[g183:270526] mca: base: components_register: found loaded component ucx
[g183:270526] mca: base: components_register: component ucx register function successful
[g183:270526] mca: base: components_open: opening pml components
[g183:270526] mca: base: components_open: found loaded component ucx
[g183:270526] mca: base: components_open: component ucx open function successful
[g183:270526] select: initializing pml component ucx
[g183:270526] select: init returned priority 51
[g183:270526] selected ucx best priority 51
[g183:270526] select: component ucx selected
-------------------------------------------------------------
STREAM version $Revision: 1.8 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Total Aggregate Array size = 105000000 (elements)
Total Aggregate Memory per array = 801.1 MiB (= 0.8 GiB).
Total Aggregate memory required = 2403.3 MiB (= 2.3 GiB).
Data is distributed across 1 MPI ranks
   Array size per MPI rank = 105000000 (elements)
   Memory per array per MPI rank = 801.1 MiB (= 0.8 GiB).
   Total memory per MPI rank = 2403.3 MiB (= 2.3 GiB).
-------------------------------------------------------------
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
The SCALAR value used for this run is 0.420000
-------------------------------------------------------------
Your timer granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 68148 microseconds.
   (= 68148 timer ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 timer ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:          13648.7     0.123378     0.123089     0.123952
Scale:         13784.7     0.122519     0.121874     0.123266
Add:           14363.8     0.175696     0.175440     0.175882
Triad:         14216.1     0.177668     0.177264     0.178539
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
[g183:270526] mca: base: close: component ucx closed
[g183:270526] mca: base: close: unloading component ucx
[g183:270526] mca: base: close: component tcp closed
[g183:270526] mca: base: close: unloading component tcp
[g183:270526] mca: base: close: component self closed
[g183:270526] mca: base: close: unloading component self

I did add 0x02c9 to our mca-btl-openib-device-params.ini file for Mellanox ConnectX6 as we are getting:

WARNING: No preset parameters were found for the device that Open MPI detected:

  Local host:            xxxx
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4123

Which is referenced in these comments:

# Note: Several vendors resell Mellanox hardware and put their own firmware
# on the cards, therefore overriding the default Mellanox vendor ID.
#
#     Mellanox      0x02c9

Forcing ucx still generates the error:

mpirun  -mca pml ucx --mca orte_base_help_aggregate 0 --mca opal_warn_on_missing_libcuda 0 -np 1 --oversubscribe ./stream_mpi --oversubscribe
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   xxxx
  Local device: mlx5_0
--------------------------------------------------------------------------
-------------------------------------------------------------
STREAM version $Revision: 1.8 $

Is there are work around for this? I tried --mca btl '^openib' which does suppress the warning but doesn't that disable IB??

Upvotes: 2

Views: 4028

Answers (1)

bkelly1984
bkelly1984

Reputation: 51

Hail Stack Overflow. I am far from an expert but wanted to leave something for the people that follow in my footsteps.

This warning is being generated by openmpi/opal/mca/btl/openib/btl_openib.c or btl_openib_component.c. I believe this is code for the openib BTL component which has been long supported by openmpi (https://www.open-mpi.org/faq/?category=openfabrics#ib-components). Debugging of this code can be enabled by setting the environment variable OMPI_MCA_btl_base_verbose=100 and running your program. In my case (openmpi-4.1.4 with ConnectX-6 on Rocky Linux 8.7) init_one_device() in btl_openib_component.c would be called, device->allowed_btls would end up equaling 0 skipping a large if statement, and since device->btls was also 0 the execution fell through to the error label. This suggests to me this is not an error so much as the openib BTL component complaining that it was unable to initialize devices.

I do not believe this component is necessary. The link above says,

In the v4.0.x series, Mellanox InfiniBand devices default to the ucx PML. The use of InfiniBand over the openib BTL is officially deprecated in the v4.0.x series, and is scheduled to be removed in Open MPI v5.0.0.

So, to your second question, no —mca btl "^openib" does not disable IB. It turns off the obsolete openib BTL which is no longer the default framework for IB. The link above has a nice table describing all the frameworks in different versions of OpenMPI.

The better solution is to compile OpenMPI without openib BTL support. Openib BTL is used for verbs-based communication so the recommendations to configure OpenMPI with the —without-verbs flags are correct. However, in my case make clean followed by configure … --without-verbs and make did not eliminate all of my previous build and the result continued to give me the warning. I was only able to eliminate it after deleting the previous install and building from a fresh download.

Upvotes: 2

Related Questions