Vim
Vim

Reputation: 91

Pytorch Djl java loading exception

I am running a Flink job in standalone deployment mode that uses Java djl to load a pytorch model. The model gets successfully loaded and I am able to cancel the job through Flink Rest API. However, when I try to launch the flink job once again, it throws,

UnsatisfiedLink Error:<pytorch>.so already loaded in another classloader

It requires a standalone deployment restart to load again. Is it possible to close the process along with the close job request so that I can load again without restarting?

Upvotes: 0

Views: 908

Answers (2)

Michał Dudzisz
Michał Dudzisz

Reputation: 1

Thanks to the Frank Liu's answer (link), adding few steps I managed to solve this issue in my environment. Or at least conceived a workaround. However I can't explain in depth all the steps, this works in my setup - also a local cluster in standalone mode.

Steps that helped me:

  1. Put this line in a code which will later be executed by a TaskManager (this is important), before calling any djl pytorch related code:
    System.setProperty("ai.djl.pytorch.native_helper", "ai.djl.pytorch.jni.NativeHelper");
  1. Exclude ai.djl.* dependencies from your fat jar. I used Maven and maven-shade-plugin to create fat jars, so I added to my pom.xml, in maven-shade-plugin a following section:
<configuration>
  <artifactSet>
    <excludes>
      <exclude>ai.djl:*</exclude>
      <exclude>ai.djl.*:*</exclude>
    </excludes>
  </artifactSet>
</configuration>

I did this to ensure Flink will use NativeHelper and other djl dependencies from its /lib directory and using its AppClassLoader, not FlinkUserCodeClassLoader.

  1. Put djl dependencies in your lib directory in Flink installation location. I worked on Ubuntu and used Maven, so I had to find appropriate jar packages in ~/.m2/repository/ai/djl/ and its subdirectories. Jars that I moved to the <some-path>/flink-1.19.1/lib directory in my case were:
  • pytorch-engine-0.9.0.jar
  • pytorch-jni-2.4.0-0.30.0.jar
  • pytorch-native-auto-1.7.0.jar
  • model-zoo-0.9.0.jar
  • api-0.9.0.jar

I also had to put to the <some-path>/flink-1.19.1/lib directory com.sun.jna, because it is a dependency of djl that was used. So I put the jar:

  • jna-5.3.0.jar

You can find it searching for class com.sun.jna.Native in dependencies in your IDE (IntelliJ Idea in my case) and then obtain its location on disk.

Versions may differ in your case and you may use less or more djl dependencies.

  1. Change Flink's classloading order in <Flink installation location>/config/config.yaml. You will find it in the file commented out, just change it to match this:
  classloader:
    resolve:
      order: parent-first

Here you can read about class loading order in Flink - Flink docs.

  1. Voilà. I can't explain in depth every step, but now it works. I can run multiple jobs loading models one after another with single TaskManager.

Upvotes: 0

Frank Liu
Frank Liu

Reputation: 346

The native library can only be loaded once per JVM. In DJL, the pytorch native library will be loaded when Engine class is initialized, if the native library has been loaded already in another classloader, the engine class will failed to initialize.

One of the workaround is to load the native library in system ClassLoader that can be shared by child classloaders. DJL allows you to inject a NativeHelper class to load the native library, you need to make sure your NativeHelper is in the system classpath:

System.setProperty("ai.djl.pytorch.native_helper", "org.examples.MyNativeHelper");

You can find the test code for NativeHelper here

See this link for more detail

In your MyNativeHelper class, you only need to add the following:

    public static void load(String path) {
        System.load(path);
    }

At runtime DJL will invoke your load(String path) function to load native library in your ClassLoader.

Upvotes: 0

Related Questions