Azure Data Factory pipeline into compressed Parquet file: “java.lang.OutOfMemoryError:Java heap space”

Question

I have a pipeline that reads data from an MS SQL Server and stores them into a file in a BLOB container in Azure Storage. The file has Parquet (or Apache Parquet, as it is also called) format.

So, when the “sink” (output) file is stored in a compressed way (snappy, or gzip – does not matter) AND the file is large enough (more than 50 Mb), the pipeline failed. The message was the following:

"errorCode": "2200",
    "message": "Failure happened on 'Sink' side. 
ErrorCode=UserErrorJavaInvocationException,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=An error occurred when invoking java, message: java.lang.OutOfMemoryError:Java heap space
total entry:11
java.util.ArrayDeque.doubleCapacity(Unknown Source)
java.util.ArrayDeque.addFirst(Unknown Source)
java.util.ArrayDeque.push(Unknown Source)
org.apache.parquet.io.ValidatingRecordConsumer.endField(ValidatingRecordConsumer.java:108)
org.apache.parquet.example.data.GroupWriter.writeGroup(GroupWriter.java:58)
org.apache.parquet.example.data.GroupWriter.write(GroupWriter.java:37)
org.apache.parquet.hadoop.example.GroupWriteSupport.write(GroupWriteSupport.java:87)
org.apache.parquet.hadoop.example.GroupWriteSupport.write(GroupWriteSupport.java:37)
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123)
org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:292)
com.microsoft.datatransfer.bridge.parquet.ParquetBatchWriter.addRows(ParquetBatchWriter.java:60)
,Source=Microsoft.DataTransfer.Common,''Type=Microsoft.DataTransfer.Richfile.JniExt.JavaBridgeException,Message=,Source=Microsoft.DataTransfer.Richfile.HiveOrcBridge,'",
    "failureType": "UserError",
    "target": "Work_Work"
}

The "Work_Work" is the name of a Copy Data activity in the pipeline. If I turn the compression off (the generated BLOB file is uncompressed), the error does not happen.

Is this the error described in link the

“…If you copy data to/from Parquet format using Self-hosted Integration Runtime and hit error saying "An error occurred when invoking java, message: java.lang.OutOfMemoryError:Java heap space", you can add an environment variable _JAVA_OPTIONS in the machine that hosts the Self-hosted IR to adjust the min/max heap size for JVM to empower such copy, then rerun the pipeline….”?

If it is, have I understood correctly that I have to do the following: To go to a server where the “Self-hosted Integration Runtime” (still have no idea what it is) and increase the max heap size for JVM. Is this correct?

If it is, my next question is: how large the max heap size should be? My pipeline can generate a file whose size will be 30 GB.
What “max heap size” can guarantee that such a file will not cause the fail?

Azure Data Factory pipeline into compressed Parquet file: “java.lang.OutOfMemoryError:Java heap space”

Answers (1)

Related Questions