Victor Sotnikov
Victor Sotnikov

Reputation: 305

Azure Data Factory pipeline into compressed Parquet file: “java.lang.OutOfMemoryError:Java heap space”

I have a pipeline that reads data from an MS SQL Server and stores them into a file in a BLOB container in Azure Storage. The file has Parquet (or Apache Parquet, as it is also called) format.

So, when the “sink” (output) file is stored in a compressed way (snappy, or gzip – does not matter) AND the file is large enough (more than 50 Mb), the pipeline failed. The message was the following:

"errorCode": "2200",
    "message": "Failure happened on 'Sink' side. 
ErrorCode=UserErrorJavaInvocationException,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=An error occurred when invoking java, message: java.lang.OutOfMemoryError:Java heap space\ntotal entry:11\r\njava.util.ArrayDeque.doubleCapacity(Unknown Source)\r\njava.util.ArrayDeque.addFirst(Unknown Source)\r\njava.util.ArrayDeque.push(Unknown Source)\r\norg.apache.parquet.io.ValidatingRecordConsumer.endField(ValidatingRecordConsumer.java:108)\r\norg.apache.parquet.example.data.GroupWriter.writeGroup(GroupWriter.java:58)\r\norg.apache.parquet.example.data.GroupWriter.write(GroupWriter.java:37)\r\norg.apache.parquet.hadoop.example.GroupWriteSupport.write(GroupWriteSupport.java:87)\r\norg.apache.parquet.hadoop.example.GroupWriteSupport.write(GroupWriteSupport.java:37)\r\norg.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123)\r\norg.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:292)\r\ncom.microsoft.datatransfer.bridge.parquet.ParquetBatchWriter.addRows(ParquetBatchWriter.java:60)\r\n,Source=Microsoft.DataTransfer.Common,''Type=Microsoft.DataTransfer.Richfile.JniExt.JavaBridgeException,Message=,Source=Microsoft.DataTransfer.Richfile.HiveOrcBridge,'",
    "failureType": "UserError",
    "target": "Work_Work"
}

The "Work_Work" is the name of a Copy Data activity in the pipeline. If I turn the compression off (the generated BLOB file is uncompressed), the error does not happen.

Is this the error described in link the

“…If you copy data to/from Parquet format using Self-hosted Integration Runtime and hit error saying "An error occurred when invoking java, message: java.lang.OutOfMemoryError:Java heap space", you can add an environment variable _JAVA_OPTIONS in the machine that hosts the Self-hosted IR to adjust the min/max heap size for JVM to empower such copy, then rerun the pipeline….”?

If it is, have I understood correctly that I have to do the following: To go to a server where the “Self-hosted Integration Runtime” (still have no idea what it is) and increase the max heap size for JVM. Is this correct?

If it is, my next question is: how large the max heap size should be? My pipeline can generate a file whose size will be 30 GB.
What “max heap size” can guarantee that such a file will not cause the fail?

Upvotes: 3

Views: 3450

Answers (1)

CHEEKATLAPRADEEP
CHEEKATLAPRADEEP

Reputation: 12788

If you copy data to/from Parquet format using Self-hosted Integration Runtime and hit error saying "An error occurred when invoking java, message: java.lang.OutOfMemoryError:Java heap space", you can add an environment variable _JAVA_OPTIONS in the machine that hosts the Self-hosted IR to adjust the min/max heap size for JVM to empower such copy, then rerun the pipeline.

enter image description here

Example: set variable _JAVA_OPTIONS with value -Xms256m -Xmx16g. The flag Xms specifies the initial memory allocation pool for a Java Virtual Machine (JVM), while Xmx specifies the maximum memory allocation pool. This means that JVM will be started with Xms amount of memory and will be able to use a maximum of Xmx amount of memory. By default, ADF use min 64MB and max 1G.

Reference: https://learn.microsoft.com/en-us/azure/data-factory/format-parquet

Hope this helps.

Upvotes: 2

Related Questions