Reputation: 1459
I'm intermittently getting out-of-memory issues on the dataflow job when inserting the data into Bigauqery using Apache Beam SDK for Java 2.29.0.
Here is the stack trace
Error message from worker: java.lang.RuntimeException: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:982)
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.insertAll(BigQueryServicesImpl.java:1022)
org.apache.beam.sdk.io.gcp.bigquery.BatchedStreamingWrite.flushRows(BatchedStreamingWrite.java:375)
org.apache.beam.sdk.io.gcp.bigquery.BatchedStreamingWrite.access$800(BatchedStreamingWrite.java:69)
org.apache.beam.sdk.io.gcp.bigquery.BatchedStreamingWrite$BatchAndInsertElements.finishBundle(BatchedStreamingWrite.java:271)
Caused by: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
java.base/java.lang.Thread.start0(Native Method)
java.base/java.lang.Thread.start(Thread.java:803)
java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:129)
java.base/java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:724)
com.google.api.client.http.javanet.NetHttpRequest.writeContentToOutputStream(NetHttpRequest.java:188)
com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:117)
com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:84)
com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1012)
com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$DatasetServiceImpl.lambda$insertAll$1(BigQueryServicesImpl.java:906)
org.apache.beam.sdk.io.gcp.bigquery.BigQueryServicesImpl$BoundedExecutorService$SemaphoreCallable.call(BigQueryServicesImpl.java:1492)
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base/java.lang.Thread.run(Thread.java:834)
I tried increasing the worker node size still seeing the same issue.
Upvotes: 0
Views: 752
Reputation: 6582
I really recommend you to upgrade your Beam
version to 2.42.0 (latest).
Also check if you have some aggregation like groupBy
or groupByKey
that are costly in memory inside a worker.
You can also use Dataflow
prime
, that is the last execution engine for Dataflow
and allows to prevent errors like outOfMemory
in a worker with vertical autoscaling :
Dataflow
prime
can be enabled with a program argument, example for Beam
Java
:
--dataflowServiceOptions=enable_prime
Dataflow
prime
helps in this case, but you have to check and optimize your job if needed and avoid costly operations if it's possible (memory leaks, useless aggregation, costly serialization...)
Upvotes: 1
Reputation: 1356
OutOfMemory
issues can be very tough to debug because the symptom you see may be totally unrelated to the sources of memory pressure. So your pipeline is throwing this when trying to create a thread in the insertAll
method, but it's possible that most of your memory usage is coming from some other part of your pipeline.
There's some in-depth advice on debugging memory issues at https://cloud.google.com/community/tutorials/dataflow-debug-oom-conditions
If the memory pressure is coming from BigQueryIO, take a look at various config options such as maxStreamingRowsToBatch.
Upvotes: 0