Reputation: 35
I'm encountering performance issues when running Deeplearning4j in a multi-threaded environment. My system slows down or gets stuck while trying inferences (fit instead works, i do fit on the main model and output on the copied models), despite creating separate model instances for each thread.
Expected Behavior: I expect the model to perform inference efficiently in a multi-threaded context.
Actual Behavior: After several inference operations, the system dramatically slows down or gets stuck.
Setup: I'm using Deeplearning4j version M2 on a Mac system with an i5 3.3GHz processor and 16 GB RAM. ND4J version is M2.1
Current Approach:
I create a new ComputationGraph
instance for each thread before performing inference. Here's a snippet of my code for cloning the model:
// Existing model
ComputationGraph model = ...;
// Cloning for each thread
ComputationGraph clone = model.clone();
// Example inference code in each thread
INDArray input = ...; // your input data
INDArray output = clone.output(input);
current model:
Map<Integer, Double> learningRateSchedule = new HashMap<>();
learningRateSchedule.put(0, 2e-5);
learningRateSchedule.put(833, 2e-6);
learningRateSchedule.put(1666, 2e-7);
ISchedule schedule = new MapSchedule(ScheduleType.ITERATION, learningRateSchedule);
ComputationGraphConfiguration.GraphBuilder graphBuilder = new NeuralNetConfiguration.Builder()
.seed(System.currentTimeMillis())
.weightInit(WeightInit.RELU)
.l2(1e-4)
.updater(new Adam(schedule))
.graphBuilder()
.addInputs("input")
.setInputTypes(InputType.convolutional(M+2, N, 1));
String lastLayer = "input";
for (int i = 0; i < nndepth; i++) {
graphBuilder.addLayer("torso_" + i + "_conv", new ConvolutionLayer.Builder()
.kernelSize(3,3)
.stride(1,1)
.nIn(i == 0 ? 1 : numHiddenNodes)
.nOut(numHiddenNodes)
.padding((3-1)/2, (3-1)/2)//padding per un kernel di 3x3
.activation(Activation.RELU)
.build(), lastLayer);
lastLayer = "torso_" + i + "_conv";
}
graphBuilder.addLayer("policy_conv",
new ConvolutionLayer.Builder()
.nIn(numHiddenNodes)
.nOut(numHiddenNodes)
.kernelSize(3,3)
.padding((3-1)/2, (3-1)/2)//padding per un kernel di 3x3
.stride(1,1)
.activation(Activation.RELU)
.build(),
lastLayer);
graphBuilder.addLayer("policy_output",
new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
.nIn(numHiddenNodes)
.nOut(numOutputs)
.activation(Activation.SOFTMAX)
.build(),
"policy_conv");
graphBuilder.addLayer("value_conv",
new ConvolutionLayer.Builder()
.nIn(numHiddenNodes)
.nOut(numHiddenNodes)
.kernelSize(3,3)
.padding((3-1)/2, (3-1)/2)//padding per un kernel di 3x3
.stride(1,1)
.activation(Activation.RELU)
.build(),
lastLayer);
graphBuilder.addLayer("value_output",
new OutputLayer.Builder(LossFunctions.LossFunction.MSE)
.nIn(numHiddenNodes)
.nOut(1)
.activation(Activation.IDENTITY)
.build(),
"value_conv");
graphBuilder.setOutputs("policy_output", "value_output");
ComputationGraphConfiguration conf = graphBuilder.build();
model = new ComputationGraph(conf);
i have also tried to experiment with different VM option, last ones i tried are: -Dlog4j.debug=true -Dlog4j.configuration=file:"path/log4j.properties" -Xms16G -Xmx16g I monitored CPU and memory usage, but the problem persists.
Any suggestions on how to improve performance or resolve the freezing issue would be greatly appreciated. Thank you!
UPDATE the problem doesn't happen if i remove padding or if i remove multithreading, that means that if only one of these (or both obv) is satisfied it runs
Upvotes: 0
Views: 77
Reputation: 3205
Use ParallelInference. We will do this for you.
Example usage:
ParallelInference inf =
new ParallelInference.Builder(model).inferenceMode(InferenceMode.SEQUENTIAL).workers(2).build();
Pass in your model and specify the type of inference you want to do and the number of workers. Note this will replicate the copy of the model to each thread.
Upvotes: 1