Reputation: 91
I am running a Flink job in standalone deployment mode that uses Java djl to load a pytorch model. The model gets successfully loaded and I am able to cancel the job through Flink Rest API. However, when I try to launch the flink job once again, it throws,
UnsatisfiedLink Error:<pytorch>.so already loaded in another classloader
It requires a standalone deployment restart to load again. Is it possible to close the process along with the close job request so that I can load again without restarting?
Upvotes: 0
Views: 908
Reputation: 1
Thanks to the Frank Liu's answer (link), adding few steps I managed to solve this issue in my environment. Or at least conceived a workaround. However I can't explain in depth all the steps, this works in my setup - also a local cluster in standalone mode.
Steps that helped me:
System.setProperty("ai.djl.pytorch.native_helper", "ai.djl.pytorch.jni.NativeHelper");
maven-shade-plugin
to create fat jars, so I added to my pom.xml
, in maven-shade-plugin
a following section:<configuration>
<artifactSet>
<excludes>
<exclude>ai.djl:*</exclude>
<exclude>ai.djl.*:*</exclude>
</excludes>
</artifactSet>
</configuration>
I did this to ensure Flink will use NativeHelper
and other djl
dependencies from its /lib
directory and using its AppClassLoader
, not FlinkUserCodeClassLoader
.
djl
dependencies in your lib
directory in Flink installation location. I worked on Ubuntu and used Maven, so I had to find appropriate jar packages in ~/.m2/repository/ai/djl/
and its subdirectories. Jars that I moved to the <some-path>/flink-1.19.1/lib
directory in my case were:pytorch-engine-0.9.0.jar
pytorch-jni-2.4.0-0.30.0.jar
pytorch-native-auto-1.7.0.jar
model-zoo-0.9.0.jar
api-0.9.0.jar
I also had to put to the <some-path>/flink-1.19.1/lib
directory com.sun.jna
, because it is a dependency of djl
that was used. So I put the jar:
jna-5.3.0.jar
You can find it searching for class com.sun.jna.Native
in dependencies in your IDE (IntelliJ Idea in my case) and then obtain its location on disk.
Versions may differ in your case and you may use less or more djl
dependencies.
<Flink installation location>/config/config.yaml
. You will find it in the file commented out, just change it to match this: classloader:
resolve:
order: parent-first
Here you can read about class loading order in Flink - Flink docs.
Upvotes: 0
Reputation: 346
The native library can only be loaded once per JVM. In DJL, the pytorch native library will be loaded when Engine
class is initialized, if the native library has been loaded already in another classloader, the engine class will failed to initialize.
One of the workaround is to load the native library in system ClassLoader
that can be shared by child classloaders. DJL allows you to inject a NativeHelper
class to load the native library, you need to make sure your NativeHelper
is in the system classpath:
System.setProperty("ai.djl.pytorch.native_helper", "org.examples.MyNativeHelper");
You can find the test code for NativeHelper
here
See this link for more detail
In your MyNativeHelper
class, you only need to add the following:
public static void load(String path) {
System.load(path);
}
At runtime DJL will invoke your load(String path)
function to load native library in your ClassLoader
.
Upvotes: 0