vkt
vkt

Reputation: 1459

Apache Beam Dataflow Bigquery Streming insertions out of memory error

I'm intermittently getting out-of-memory issues on the dataflow job when inserting the data into Bigauqery using Apache Beam SDK for Java 2.29.0.

Here is the stack trace

    Error message from worker: java.lang.RuntimeException: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
        org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:982)
        org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:1022)
        org.apache.beam.sdk.io.gcp.bigquery.BatchedStreamingWrite.flushRows(BatchedStreamingWrite.java:375)
        org.apache.beam.sdk.io.gcp.bigquery.BatchedStreamingWrite.access$800(BatchedStreamingWrite.java:69)
        org.apache.beam.sdk.io.gcp.bigquery.BatchedStreamingWrite$BatchAndInsertElements.finishBundle(BatchedStreamingWrite.java:271)
Caused by: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
        java.base/java.lang.Thread.start0(Native Method)
        java.base/java.lang.Thread.start(Thread.java:803)
        java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
        java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
        java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:129)
        java.base/java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:724)
        com.google.api.client.http.javanet.NetHttpRequest.writeContentToOutputStream(NetHttpRequest.java:188)
        com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:117)
        com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:84)
        com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1012)
        com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
        com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
        com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
        org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.lambda$insertAll$1(BigQueryServicesImpl.java:906)
        org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$BoundedExecutorService$SemaphoreCallable.call(BigQueryServicesImpl.java:1492)
        java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        java.base/java.lang.Thread.run(Thread.java:834)

I tried increasing the worker node size still seeing the same issue.

Upvotes: 0

Views: 752

Answers (2)

Mazlum Tosun
Mazlum Tosun

Reputation: 6582

I really recommend you to upgrade your Beam version to 2.42.0 (latest).

Also check if you have some aggregation like groupBy or groupByKey that are costly in memory inside a worker.

You can also use Dataflow prime, that is the last execution engine for Dataflow and allows to prevent errors like outOfMemory in a worker with vertical autoscaling :

dataflow prime

Dataflow prime can be enabled with a program argument, example for Beam Java :

--dataflowServiceOptions=enable_prime

Dataflow prime helps in this case, but you have to check and optimize your job if needed and avoid costly operations if it's possible (memory leaks, useless aggregation, costly serialization...)

Upvotes: 1

Jeff Klukas
Jeff Klukas

Reputation: 1356

OutOfMemory issues can be very tough to debug because the symptom you see may be totally unrelated to the sources of memory pressure. So your pipeline is throwing this when trying to create a thread in the insertAll method, but it's possible that most of your memory usage is coming from some other part of your pipeline.

There's some in-depth advice on debugging memory issues at https://cloud.google.com/community/tutorials/dataflow-debug-oom-conditions

If the memory pressure is coming from BigQueryIO, take a look at various config options such as maxStreamingRowsToBatch.

Upvotes: 0

Related Questions