Puneet Durve
Puneet Durve

Reputation: 1

Getting ud_ep.c:278 Fatal: UD endpoint 0x22fe520 to <no debug data>: unhandled timeout error while trying to OSU microbenchmarks using OpenMPI & UCX

I have a couple of servers with some NICs in them, I have installed ompi, ucx and osu-microbenchmarks. I am running the following command,

mpirun --mca pml ucx --mca osc ucx --mca spml ucx --mca btl ^vader,tcp,openib,uct -x UCX_NET_DEVICES=mlx5_1:1 -x UCX_TLS=self,sm,rc_v -x UCX_IB_GID_INDEX=3 -hostfile hosts /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency

Which is running fine on my other NIC setup but on my current setup its is giving me an error,

        ud_ep.c:278  Fatal: UD endpoint 0x22fe520 to <no debug data>: unhandled timeout error
        ==== backtrace (tid:   4061) ====
         0  /root/ucx/ucx_install/lib/libucs.so.0(ucs_handle_error+0x294) [0x7fdb4153bda4]
         1  /root/ucx/ucx_install/lib/libucs.so.0(ucs_fatal_error_message+0xb2) [0x7fdb41539162]
         2  /root/ucx/ucx_install/lib/libucs.so.0(+0x2a239) [0x7fdb41539239]
         3  /root/ucx/ucx_install/lib/ucx/libuct_ib.so.0(+0x5e050) [0x7fdb41493050]
         4  /root/ucx/ucx_install/lib/libucs.so.0(+0x21467) [0x7fdb41530467]
         5  /root/ucx/ucx_install/lib/libucp.so.0(ucp_worker_progress+0x2a) [0x7fdb415ee91a]
         6  /root/ompi/ompi_install/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0x157) [0x7fdb41683df7]
         7  /root/ompi/ompi_install/lib/libmpi.so.40(ompi_coll_base_barrier_intra_recursivedoubling+0xbb) [0x7fdb438d17fb]
         8  /root/ompi/ompi_install/lib/libmpi.so.40(MPI_Barrier+0xa8) [0x7fdb43882ea8]
         9  /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency() [0x4027fe]
        10  /lib64/libc.so.6(+0x44e50) [0x7fdb4332de50]
        11  /lib64/libc.so.6(__libc_start_main+0x7c) [0x7fdb4332defc]
        12  /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency() [0x4031b5]
        =================================
        [tst-srv-193:04061] *** Process received signal ***
        [tst-srv-193:04061] Signal: Aborted (6)
        [tst-srv-193:04061] Signal code:  (-6)
        [tst-srv-193:04061] [ 0] /lib64/libc.so.6(+0x59db0)[0x7fdb43342db0]
        [tst-srv-193:04061] [ 1] /lib64/libc.so.6(+0xa642c)[0x7fdb4338f42c]
        [tst-srv-193:04061] [ 2] /lib64/libc.so.6(raise+0x16)[0x7fdb43342d06]
        [tst-srv-193:04061] [ 3] /lib64/libc.so.6(abort+0xd3)[0x7fdb433157d3]
        [tst-srv-193:04061] [ 4] /root/ucx/ucx_install/lib/libucs.so.0(+0x2a167)[0x7fdb41539167]
        [tst-srv-193:04061] [ 5] /root/ucx/ucx_install/lib/libucs.so.0(+0x2a239)[0x7fdb41539239]
        [tst-srv-193:04061] [ 6] /root/ucx/ucx_install/lib/ucx/libuct_ib.so.0(+0x5e050)[0x7fdb41493050]
        [tst-srv-193:04061] [ 7] /root/ucx/ucx_install/lib/libucs.so.0(+0x21467)[0x7fdb41530467]
        [tst-srv-193:04061] [ 8] /root/ucx/ucx_install/lib/libucp.so.0(ucp_worker_progress+0x2a)[0x7fdb415ee91a]
        [tst-srv-193:04061] [ 9] /root/ompi/ompi_install/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0x157)[0x7fdb41683df7]
        [tst-srv-193:04061] [10] /root/ompi/ompi_install/lib/libmpi.so.40(ompi_coll_base_barrier_intra_recursivedoubling+0xbb)[0x7fdb438d17fb]
        [tst-srv-193:04061] [11] /root/ompi/ompi_install/lib/libmpi.so.40(MPI_Barrier+0xa8)[0x7fdb43882ea8]
        [tst-srv-193:04061] [12] /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x4027fe]
        [tst-srv-193:04061] [13] /lib64/libc.so.6(+0x44e50)[0x7fdb4332de50]
        [tst-srv-193:04061] [14] /lib64/libc.so.6(__libc_start_main+0x7c)[0x7fdb4332defc]
        [tst-srv-193:04061] [15] /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x4031b5]
        [tst-srv-193:04061] *** End of error message ***
        [tst-srv-192:3952 :0:3952]       ud_ep.c:278  Fatal: UD endpoint 0x24344f0 to <no debug data>: unhandled timeout error
        ==== backtrace (tid:   3952) ====
         0  /root/ucx/ucx_install/lib/libucs.so.0(ucs_handle_error+0x294) [0x7f2892c6ada4]
         1  /root/ucx/ucx_install/lib/libucs.so.0(ucs_fatal_error_message+0xb2) [0x7f2892c68162]
         2  /root/ucx/ucx_install/lib/libucs.so.0(+0x2a239) [0x7f2892c68239]
         3  /root/ucx/ucx_install/lib/ucx/libuct_ib.so.0(+0x5e050) [0x7f2892bc2050]
         4  /root/ucx/ucx_install/lib/libucs.so.0(+0x21467) [0x7f2892c5f467]
         5  /root/ucx/ucx_install/lib/libucp.so.0(ucp_worker_progress+0x2a) [0x7f2892d1d91a]
         6  /root/ompi/ompi_install/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0x157) [0x7f2892db2df7]
         7  /root/ompi/ompi_install/lib/libmpi.so.40(ompi_coll_base_barrier_intra_recursivedoubling+0xbb) [0x7f2898ffe7fb]
         8  /root/ompi/ompi_install/lib/libmpi.so.40(MPI_Barrier+0xa8) [0x7f2898fafea8]
         9  /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency() [0x4027fe]
        10  /lib64/libc.so.6(+0x44e50) [0x7f2898a5ae50]
        11  /lib64/libc.so.6(__libc_start_main+0x7c) [0x7f2898a5aefc]
        12  /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency() [0x4031b5]
        =================================
        [tst-srv-192:03952] *** Process received signal ***
        [tst-srv-192:03952] Signal: Aborted (6)
        [tst-srv-192:03952] Signal code:  (-6)
        [tst-srv-192:03952] [ 0] /lib64/libc.so.6(+0x59db0)[0x7f2898a6fdb0]
        [tst-srv-192:03952] [ 1] /lib64/libc.so.6(+0xa642c)[0x7f2898abc42c]
        [tst-srv-192:03952] [ 2] /lib64/libc.so.6(raise+0x16)[0x7f2898a6fd06]
        [tst-srv-192:03952] [ 3] /lib64/libc.so.6(abort+0xd3)[0x7f2898a427d3]
        [tst-srv-192:03952] [ 4] /root/ucx/ucx_install/lib/libucs.so.0(+0x2a167)[0x7f2892c68167]
        [tst-srv-192:03952] [ 5] /root/ucx/ucx_install/lib/libucs.so.0(+0x2a239)[0x7f2892c68239]
        [tst-srv-192:03952] [ 6] /root/ucx/ucx_install/lib/ucx/libuct_ib.so.0(+0x5e050)[0x7f2892bc2050]
        [tst-srv-192:03952] [ 7] /root/ucx/ucx_install/lib/libucs.so.0(+0x21467)[0x7f2892c5f467]
        [tst-srv-192:03952] [ 8] /root/ucx/ucx_install/lib/libucp.so.0(ucp_worker_progress+0x2a)[0x7f2892d1d91a]
        [tst-srv-192:03952] [ 9] /root/ompi/ompi_install/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0x157)[0x7f2892db2df7]
        [tst-srv-192:03952] [10] /root/ompi/ompi_install/lib/libmpi.so.40(ompi_coll_base_barrier_intra_recursivedoubling+0xbb)[0x7f2898ffe7fb]
        [tst-srv-192:03952] [11] /root/ompi/ompi_install/lib/libmpi.so.40(MPI_Barrier+0xa8)[0x7f2898fafea8]
        [tst-srv-192:03952] [12] /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x4027fe]
        [tst-srv-192:03952] [13] /lib64/libc.so.6(+0x44e50)[0x7f2898a5ae50]
        [tst-srv-192:03952] [14] /lib64/libc.so.6(__libc_start_main+0x7c)[0x7f2898a5aefc]
        [tst-srv-192:03952] [15] /usr/local/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency[0x4031b5]
        [tst-srv-192:03952] *** End of error message ***

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Installation of OpenMPI, UCX & OSU microbenchmarks seems fine to me. I tried removing the ucx net device so that it picks the first NIC automatically but that didnt seem to work. Looking for any pointers as to how I can go about this.

Upvotes: 0

Views: 311

Answers (0)

Related Questions