jamiet
jamiet

Reputation: 12224

How can I use dataproc to pull data from bigquery that is not in the same project as my dataproc cluster?

I work for an organisation that needs to pull data from one of our client's bigquery datasets using Spark and given that both the client and ourselves use GCP it makes sense to use Dataproc to achieve this.

I have read Use the BigQuery connector with Spark which looks very useful however it seems to make the assumption that the dataproc cluster, the bigquery dataset and the storage bucket for temporary BigQuery export are all in the same GCP project - that is not the case for me.

I have a service account key file that allows me to connect to and interact with our client's data stored in bigquery, how can I use that service account key file in conjunction with the BigQuery connector and dataproc in order to pull data from bigquery and interact with it in dataproc? To put it another way, how can I modify the code provided at Use the BigQuery connector with Spark to use my service account key file?

Upvotes: 0

Views: 446

Answers (1)

Igor Dvorzhak
Igor Dvorzhak

Reputation: 4457

To use service account key file authorization you need to set mapred.bq.auth.service.account.enable property to true and point BigQuery connector to a service account json keyfile using mapred.bq.auth.service.account.json.keyfile property (cluster or job). Note that this property value is a local path, that's why you need to distribute a keyfile to all the cluster nodes beforehand, using initialization action, for example.

Alternatively, you can use any authorization method described here, but you need to replace fs.gs properties prefix with mapred.bq for BigQuery connector.

Upvotes: 3

Related Questions