Reputation: 415
trying to read a CSV file of size 10+ GB using apache beam FileIO in the dataflow job by calling the function ReadableFile.readFullyAsUTF8String
. And, its failing with below error.
Looks like reading the file with size exceeding INTEGER.MAX_VALUE is failing. please advice.
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
at org.apache.beam.sdk.util.StreamUtils.getBytes(StreamUtils.java:64)
at org.apache.beam.sdk.io.FileIO$ReadableFile.readFullyAsBytes(FileIO.java:419)
at org.apache.beam.sdk.io.FileIO$ReadableFile.readFullyAsUTF8String(FileIO.java:424) ```
Upvotes: 0
Views: 1016
Reputation: 551
The Dataflow runner defaults to using n1-standard-1 instances in most cases, I believe. These don't have all that much memory. You can override this setting by passing the workerMachineType
parameter to the runner to specify a machine type that has >10GB of memory.
However, this approach does not take full advantage of the parallel benefits of running with Apache Beam. Reading the entire file into memory creates a bottleneck and high-memory load that you might not experience if you were to split the reading into multiple fragments. You might want to look into other methods for reading your CSV. For instance, TextIO might be useful if each line of the CSV is a separate entry. This approach won't work, however, if you need the entire file contents at once for some reason, e.g. the file is compressed.
Upvotes: 2