Deeplearning4j: System Freezes During Multi-threaded Inference despite Separate Model Instances

Question

I'm encountering performance issues when running Deeplearning4j in a multi-threaded environment. My system slows down or gets stuck while trying inferences (fit instead works, i do fit on the main model and output on the copied models), despite creating separate model instances for each thread.

Expected Behavior: I expect the model to perform inference efficiently in a multi-threaded context.

Actual Behavior: After several inference operations, the system dramatically slows down or gets stuck.

Setup: I'm using Deeplearning4j version M2 on a Mac system with an i5 3.3GHz processor and 16 GB RAM. ND4J version is M2.1

Current Approach: I create a new ComputationGraph instance for each thread before performing inference. Here's a snippet of my code for cloning the model:

// Existing model
ComputationGraph model = ...;

// Cloning for each thread
ComputationGraph clone = model.clone();

// Example inference code in each thread
INDArray input = ...; // your input data
INDArray output = clone.output(input);

current model:

Map learningRateSchedule = new HashMap<>();
                learningRateSchedule.put(0, 2e-5);
                learningRateSchedule.put(833, 2e-6);
                learningRateSchedule.put(1666, 2e-7);
                ISchedule schedule = new MapSchedule(ScheduleType.ITERATION, learningRateSchedule);
ComputationGraphConfiguration.GraphBuilder graphBuilder = new NeuralNetConfiguration.Builder()
                        .seed(System.currentTimeMillis())
                        .weightInit(WeightInit.RELU)
                        .l2(1e-4)
                        .updater(new Adam(schedule))
                        .graphBuilder()
                        .addInputs("input")
                        .setInputTypes(InputType.convolutional(M+2, N, 1));

                String lastLayer = "input";
                for (int i = 0; i < nndepth; i++) {
                    graphBuilder.addLayer("torso_" + i + "_conv", new ConvolutionLayer.Builder()
                            .kernelSize(3,3)
                            .stride(1,1)
                            .nIn(i == 0 ? 1 : numHiddenNodes)
                            .nOut(numHiddenNodes)
                            .padding((3-1)/2, (3-1)/2)//padding per un kernel di 3x3
                            .activation(Activation.RELU)
                            .build(), lastLayer);
                    lastLayer = "torso_" + i + "_conv";
                }
                graphBuilder.addLayer("policy_conv",
                        new ConvolutionLayer.Builder()
                                .nIn(numHiddenNodes)
                                .nOut(numHiddenNodes)
                                .kernelSize(3,3)
                                .padding((3-1)/2, (3-1)/2)//padding per un kernel di 3x3
                                .stride(1,1)
                                .activation(Activation.RELU)
                                .build(),
                        lastLayer);
                graphBuilder.addLayer("policy_output",
                        new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
                                .nIn(numHiddenNodes)
                                .nOut(numOutputs)
                                .activation(Activation.SOFTMAX)
                                .build(),
                        "policy_conv");

                graphBuilder.addLayer("value_conv",
                        new ConvolutionLayer.Builder()
                                .nIn(numHiddenNodes)
                                .nOut(numHiddenNodes)
                                .kernelSize(3,3)
                                .padding((3-1)/2, (3-1)/2)//padding per un kernel di 3x3
                                .stride(1,1)
                                .activation(Activation.RELU)
                                .build(),
                        lastLayer);
                graphBuilder.addLayer("value_output",
                        new OutputLayer.Builder(LossFunctions.LossFunction.MSE)
                                .nIn(numHiddenNodes)
                                .nOut(1)
                                .activation(Activation.IDENTITY)
                                .build(),
                        "value_conv");

                graphBuilder.setOutputs("policy_output", "value_output");

                ComputationGraphConfiguration conf = graphBuilder.build();
                model = new ComputationGraph(conf);

i have also tried to experiment with different VM option, last ones i tried are: -Dlog4j.debug=true -Dlog4j.configuration=file:"path/log4j.properties" -Xms16G -Xmx16g I monitored CPU and memory usage, but the problem persists.

Are there specific Deeplearning4j or ND4J configurations that can help in this multi-threaded setup?
Could there be any underlying issues with my model configuration leading to these performance bottlenecks?
Any JVM settings or system-level adjustments recommended for such Deeplearning4j workloads?

Any suggestions on how to improve performance or resolve the freezing issue would be greatly appreciated. Thank you!

UPDATE the problem doesn't happen if i remove padding or if i remove multithreading, that means that if only one of these (or both obv) is satisfied it runs

Deeplearning4j: System Freezes During Multi-threaded Inference despite Separate Model Instances

Answers (1)

Related Questions