Load a huge data from BigQuery to python/pandas/dask

I read other similar threads and searched Google to find a better way but couldn't find any workable solution.

I have a large large table in BigQuery (assume inserting 20 million rows per day). I want to have around 20 million rows of data with around 50 columns in python/pandas/dask to do some analysis. I have tried using bqclient, panda-gbq and bq storage API methods but it takes 30 min to have 5 millions rows in python. Is there any other way to do so? Even any Google service available to do similar job?

Upvotes: 6

Answers (5)

bnaul

Reputation: 17656

A couple years late, but we're developing a new dask_bigquery library to help easily move back and forth between BQ and Dask dataframes. Check it out and let us know what you think!

Upvotes: 0

saifuddin778

Reputation: 7298

Instead of querying, you can always export stuff to cloud storage -> download locally -> load into your dask/pandas dataframe:

Export + Download:

bq --location=US extract --destination_format=CSV --print_header=false 'dataset.tablename' gs://mystoragebucket/data-*.csv &&  gsutil -m cp gs://mystoragebucket/data-*.csv /my/local/dir/

Load into Dask:

>>> import dask.dataframe as dd
>>> df = dd.read_csv("/my/local/dir/*.csv")

Hope it helps.

Upvotes: 6

Zaki Indra Sukma

Reputation: 57

Probably you want to export the data to Google Cloud Storage first, and then download the data to your local machine and load it. Here are the steps you need to take:

Create an intermediate table which will contain the data you want to export. You can do select and store to the intermediate table.
Export the intermediate table to Google Cloud Storage, to JSON/Avro/Parquet format.
Download your exported data and load to your python app.

Besides downloading the data to your local machine, you can leverage the processing using PySpark and SparkSQL. After you export the data to Google Cloud Storage, you can spin up a Cloud Dataproc cluster and load the data from Google Cloud Storage to Spark, and do analysis there.

You can read the example here

https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example

and you can also spin up Jupyter Notebook in the Dataproc cluster

https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook

Hope this helps.

Upvotes: 0

mdurant

Reputation: 28684

First, you should profile your code to find out what is taking the time. Is it just waiting for big-query to process your query? Is it the download of data> What is your bandwidth, what fraction do you use? Is it parsing of that data into memory?

Since you can make SQLAlchemy support big-query ( https://github.com/mxmzdlv/pybigquery ), you could try to use dask.dataframe.read_sql_table to split your query into partitions and load/process them in parallel. In case big-query is limiting the bandwidth on a single connection or to a single machine, you may get much better throughput by running this on a distributed cluster.

Experiment!

Upvotes: 2

Lak

Reputation: 4166

Some options:

Try to do aggregations etc. in BigQuery SQL before exporting (a smaller table) to Pandas.
Run your Jupyter notebook on Google Cloud, using a Deep Learning VM on a high-memory machine in the same region as your BigQuery dataset. That way, network overhead is minimized.

Upvotes: 1

Load a huge data from BigQuery to python/pandas/dask

Answers (5)

Related Questions