Reputation: 703
I have two shared libraries libA and libB used on Linux, which are used in two ways:
1. Directly linked as shared libs to an "offline" test executable.
2. Used in the real application: an auxiliary wrapper library (libWrapper) is linked against libA and libB, the application opens only the wrapper lib using system call dlopen("libWrapper.so", RTLD_NOW | RTLD_LOCAL)
.
The problem: the libraries run complex image analysis algorithms, and sometimes the numeric results are not equal. I should find a way to make sure the test executable gives the same results as the real application, but I am not permitted to change the libraries nor the real application, but only the test executable.
I used LD_DEBUG=bindings to find differences in the output (to stderr):
$ grep acosf log-bindings.test-executable # *"offline" test executable*
binding file libB.so to libA.so: normal symbol `acosf.J'
binding file libB.so to libA.so: normal symbol `acosf.A'
binding file libA.so to libA.so: normal symbol `acosf.J'
binding file libA.so to libA.so: normal symbol `acosf.A'
binding file libB.so to libA.so: normal symbol `acosf' <<<<<<<
binding file libA.so to libA.so: normal symbol `acosf' <<<<<<<
$ grep acosf log-bindings.process # logging from *real process*
binding file libB.so to libA.so: normal symbol `acosf.J'
binding file libB.so to libA.so: normal symbol `acosf.A'
binding file libB.so to libB.so: normal symbol `_ZSt4acosf' # std::acosf
binding file libB.so to **libm**.so.6: normal symbol `acosf' <<<<<<
binding file libA.so to libA.so: normal symbol `acosf.J'
binding file libA.so to libA.so: normal symbol `acosf.A'
binding file libA.so to **libm**.so.6: normal symbol `acosf' <<<<<<
(paths removed for clarity)
This suggests, that with the real application a lot of math functions symbols (cos, cosf, exp, expf, sin, sinf, acos....) are used from the system math library libm, while with the test executable the bindings are from libB to the library libA, and from libA to libA itself. This could be the reason for the differences.
May I take function acosf() as example: With linker option -y acosf we get output during build by passing -Wl,yacosf to the compiler:
release/libBdl/lib/libA.so: definition of acosf
release/libBdl/lib/libB.so: reference to acosf
I use the nm tool to show symbols in the libraries:
$ nm libA/libA.so | grep acosf
00665200 T acosf # impl. of acosf (text symbol)
0066c360 T acosf.A
0066c55c T acosf.J
00271fae t _Z13acosf_checkedf # acosf_checked(float)
00708244 r _Z13acosf_checkedf$$LSDA
$ nm libB/libB.so | grep acosf
01423780 T acosf # impl. of acosf (text symbol)
01424410 T acosf.A
0142460c T acosf.J
004c1b3a W _ZSt4acosf
01547eec r _ZSt4acosf$$LSDA
Although the math lib on the release computer has no symbols, I assume the method of libm is the same: it defines weak symbols expf or acosf in teh lib, which the user should be able to override in his own lib with a strong symbol:
[newer CentOS7 system]$ nm /usr/lib/libm.so|grep acosf
0001b9c0 W acosf # weak symbol 'acosf'
0001b9c0 t __acosf # strong symbol / implementation
000176b0 T __acosf_finite
000176b0 t __ieee754_acosf # called by __acosf in libm
[newer CentOS7 system]$ nm /usr/lib/libm.so|grep expf
0001bc60 W expf # weak symbol 'expf'
0001bc60 t __expf # strong symbol / implementation
00017990 i __expf_finite
0002d370 t __expf_finite_ia32
0002d1b0 t __expf_finite_sse2
00017960 i __ieee754_expf # called by __expf in libm
0002d330 t __ieee754_expf_ia32
0002d1b0 t __ieee754_expf_sse2
readelf -Ws ..| grep acosf result:
test-executable:
--
real-application:
--
libWrapper.so:
--
libB.so:
3934: 004c12a6 40 FUNC WEAK DEFAULT 10 _ZSt4acosf
5855: 01423b80 506 FUNC GLOBAL DEFAULT 10 acosf.A
10422: 01423d7c 666 FUNC GLOBAL DEFAULT 10 acosf.J
14338: 01422ef0 40 FUNC GLOBAL DEFAULT 10 acosf
libA.so:
2333: 0066c1e8 506 FUNC GLOBAL DEFAULT 10 acosf.A
4179: 0066c3e4 666 FUNC GLOBAL DEFAULT 10 acosf.J
5772: 00665088 40 FUNC GLOBAL DEFAULT 10 acosf
I think, the problems with symbol bindings are the typical Unix system-V problems described in https://en.wikipedia.org/wiki/Weak_symbol in section "Limitations". With dlopen() the dynamic linker prefers libm with its weak symbol, because it is already loaded, although a strong symbol is available in libA "later". ~
With LD_DEBUG=all:
test-executable:
symbol=expf; lookup in file=./test-executable.shared
symbol=expf; lookup in file=/lib/libdl.so.2
symbol=expf; lookup in file=/home/test/test/bin_NDEBUG/libA/libA.so
binding file libB.so to libA.so: normal symbol `expf' <<<<
symbol=acosf; lookup in file=./test-executable.shared
symbol=acosf; lookup in file=/lib/libdl.so.2
symbol=acosf; lookup in file=/home/test/test/bin_NDEBUG/libA/libA.so
binding file libA.so to libA.so: normal symbol `acosf' <<<<
real-application:
symbol=expf; lookup in file=real-application
symbol=expf; lookup in file=/home/test/lib/libX1.so
symbol=expf; lookup in file=/home/test/lib/libX2.so
symbol=expf; lookup in file=/home/test/lib/libX3.so
symbol=expf; lookup in file=/home/test/lib/libX4.so
symbol=expf; lookup in file=/lib/libdl.so.2
symbol=expf; lookup in file=/usr/lib/libstdc++.so.5
symbol=expf; lookup in file=/home/test/lib/libX5.so
symbol=expf; lookup in file=/lib/i686/libm.so.6
binding file libA.so to libm.so.6: normal symbol `expf' <<<<<<<
symbol=acosf; lookup in file=real-application
symbol=acosf; lookup in file=/home/test/lib/libX1.so
symbol=acosf; lookup in file=/home/test/lib/libX2.so
symbol=acosf; lookup in file=/home/test/lib/libX3.so
symbol=acosf; lookup in file=/home/test/lib/libX4.so
symbol=acosf; lookup in file=/lib/libdl.so.2
symbol=acosf; lookup in file=/usr/lib/libstdc++.so.5
symbol=acosf; lookup in file=/home/test/lib/libX5.so
symbol=acosf; lookup in file=/lib/i686/libm.so.6
binding file libA.so to libm.so.6: normal symbol `acosf' <<<<<<
The auxiliary lib "libWrapper" is linked to libA and libB but does not have the symbol acosf.
The platform is an old 32-bit Linux using kernel 2.4 and glibc 2.2.5 (yes, 2001!).
The libs A and B are built using an Intel Icc compiler with options -O3, NDEBUG. With DEBUG there does not seem to be a problem. The static / archive build has slightly different results compared with the shared linking.
The test executable is linked directly to shared libs libA and libB using g++ (or icc, makes no difference). I tried hard to get the test executable to also bind the math symbols to libm, by use of LD_PRELOAD or various linker flags, but this did not change anything.
My hypothesis: The dlopen call in the real application does come much later, after the usual libraries (and libm) are loaded and the application is started. And symbols are preferred if already found in previously loaded libs although the symbol there is a weak symbol, and a strong symbol available in libA. Probably this is just the behaviour of the old Linux, but the Wikipedia article on weak symbols in section "Limitations" describes just such an weakness of the linker for Unix system-V like systems.
I tried
linker option -Wl,--no-whole-archive
define LD_BIND_NOW
define LD_PRELOAD=libm.so
for the test-executable, but this had no effect on the symbol binding:
symbol=acosf; lookup in file=./test-executable.shared
symbol=acosf; lookup in file=/lib/i686/libm.so.6
symbol=acosf; lookup in file=/lib/libdl.so.2
symbol=acosf; lookup in file=libA.so
binding file libA.so to libA.so: normal symbol `acosf'
My Question: why is it, that even with LD_PRELOAD the test-executable does not change and sticks to the in-library implementations (of libA), but using dlopen it uses libm symbols?!? And how could I force the test-executable to behave equally as the real-application, i.e. use libm symbols?
Regrettably several modern flags to dlopen are not available, and also the linker misses e.g. --exclude-symbols. Also LD_DYNAMIC_WEAK environment variable is not available on the old Linux. Probably the only solution is to rewrite the test executable to use dlopen, too.
Any ideas are appreciated.
Upvotes: 3
Views: 733
Reputation: 703
I think I can answer the question myself.
The dlopen
call in the real application does come much later, after the usual libraries (and libm) are loaded and the application execution is started. And symbols are preferred if already found in previously loaded libs although the symbol there is a weak symbol, and a strong symbol available in libA (loaded via dlopen later in program execution). A Wikipedia article on weak symbols in section "Limitations" describes just such an weakness of the dynamic linker ld-linux.so for Unix system-V like systems (in this case Linux).
With LD_DEBUG=all you can see how the linker searches a symbol.
In this case, where the original application and the shared libs must not be changed (linker flags, how and which symbols are exported), the only solution remains to rewrite the test executable to also use dlopen (as the real application).
Upvotes: 0
Reputation: 213879
I am not permitted to change the libraries or the real application.
If you are not allowed to change anything, then you can't fix the problem.
I used LD_DEBUG=bindings to find differences, and found that ...
LD_DEBUG
is the wrong tool for debugging this. Use GDB instead.
Set a breakpoint on e.g. cos
, run the two binaries, and confirm that they are in fact executing different code. Once you know that cos
in one of the cases resides in libA
(I can't quite parse your description, but I think that's what you claim to have observed), figure out how it gets into libA
(use linker flag -Wl,-y,cos
to determine that).
Symbol visibility may play a part is why symbol resolution behaves differently. Exact command line used to link prod-exe, test-exe, libA.so and libB.so may matter. Running readelf -Ws prot-exe test-exe libA.so libB.so | grep ' cos$'
may also be illuminating.
Once you have all the info (and assuming you still can't understand what's happening), ask a new question with more detailed record of observations.
Upvotes: 0