Why does Cloud Dataflow run its workers in a different region from where my data lies?

Question

In an evaluation of GCP as a potential analytics platform for our business, I have set up a Cloud Storage bucket to be located in the EU. I have configured my BigQuery dataset to also be located in the EU. But when I run an ETL job in the Cloud Dataflow service that moves data from the former to the latter, I see the following message in the logs:

Worker configuration: n1-standard-1 in us-central1-f

Apart from the technical questions that arise regarding performance and latency, I am also concerned about the legal aspect of having data that needs to stay within EU roundtripping to US datacenters for processing.

I cannot specify worker location in the DataflowPipelineRunner options, and I can't make any sense in the Data Processing and Security Terms of whether or not I can assume that my data doesn't move.

Is it expected that Cloud Dataflow may process my data geographically anywhere it find convenient, regardless of where it is stored or where it is destined?

jkff · Accepted Answer

According to the documentation:

The Dataflow service deploys Compute Engine resources in the zone us-central1-f by default. You can override this setting by specifying the --zone option when you create your pipeline.

This option is declared in DataflowPipelineWorkerPoolOptions.

Why does Cloud Dataflow run its workers in a different region from where my data lies?

Answers (1)

Related Questions