Andres Urrego Angel
Andres Urrego Angel

Reputation: 1932

GCP Dataproc spark consuming BigQuery

I'm very new on GCP Google Cloud Platform, so I hope my question will not look so silly.

Footstage:

The main goals is gather few extend tables from BigQuery and apply few transformations. Because of the size of the tables I'm planning use Dataproc deploying a Pyspark script, ideally I would be able to use sqlContext to apply few sql queries to the DFs (tables pulled from BQ). Finally, I could easily dump this info into a file within a data storage bucket.

Questions :

my script

After have a back and forth with @Tanvee that kindly attend this question we conclude that GCP requires an intermediate allocation step when you need to read data from DataStorage into your Dataproc. Briefly, your spark or hadoop script might need a temporal bucket where store the data from the table and then bring it into Spark.

References:

Big Query Connector \ Deployment

thanks so much

Upvotes: 3

Views: 2792

Answers (2)

Sarang Shinde
Sarang Shinde

Reputation: 737

You can directly use following options to connect bigquery table from spark.

  1. You can also use spark-bigquery connectors https://github.com/samelamin/spark-bigquery to directly run your queries on dataproc using spark.

  2. https://github.com/GoogleCloudPlatform/spark-bigquery-connector This is new connector which is in beta. This is spark datasource api to bigquery which is easy to use.

Please refer following link: Dataproc + BigQuery examples - any available?

Upvotes: 0

Tanveer Uddin
Tanveer Uddin

Reputation: 1525

You will need to use BigQuery connector for spark. There are some examples in the GCP documentation here and here. It will create RDD which you can convert to dataframe and then you will be able to perform all typical transformations. Hope that helps.

Upvotes: 2

Related Questions