MT467
MT467

Reputation: 698

Processing huge dataset from bigquery using python, load it back to a bigquery table

I have a huge dataset in bigquery with 50million rows and 57 columns. I want to do a lot of filtering/transformation/cleaning not using sql. I tried using dask/panda/python to load data in a dask dataframe in my local mac, do the transformation stuff then push data back to bigquery so other BUs can use it. Pushing data back to bigquery takes more than 3 hours. Is there any other way or maybe google cloud service that I can leverage?

Upvotes: 0

Views: 834

Answers (1)

Kolban
Kolban

Reputation: 15246

If you have a large amount of data within BigQuery and wish to perform transformation upon it, one possible solution would be to use the GCP based capability called Dataflow. Dataflow is Googles managed service based upon Apache Beam. Using this technology one can write a pipeline with both BigQuery as a source and as a sink. Dataflow is specifically designed for extremely high volume data processing and can parallelize the work automatically. In addition, since it all runs within GCP, there is no meaningful latency in reading or writing the data as might be found if you were transferring it over the Internet. Dataflow allows a programmer to write transformations in either Java or Python.

Depending on your transformation, a higher level (but similar story) might be the use of Google's Dataprep service. Dataprep provides a high level (business level) mechanism to transform data without any programming required. Using Dataprep, one describes the transform at a much higher level which eventually automatically builds and runs a Datalow job on your behalf.

Upvotes: 1

Related Questions