Reputation: 3514
I'm trying to upload a pandas.DataFrame
to Google Big Query using the pandas.DataFrame.to_gbq()
function documented here. The problem is that to_gbq()
takes 2.3 minutes while uploading directly to Google Cloud Storage takes less than a minute. I'm planning to upload a bunch of dataframes (~32) each one with a similar size, so I want to know what is the faster alternative.
This is the script that I'm using:
dataframe.to_gbq('my_dataset.my_table',
'my_project_id',
chunksize=None, # I have tried with several chunk sizes, it runs faster when it's one big chunk (at least for me)
if_exists='append',
verbose=False
)
dataframe.to_csv(str(month) + '_file.csv') # the file size its 37.3 MB, this takes almost 2 seconds
# manually upload the file into GCS GUI
print(dataframe.shape)
(363364, 21)
My question is, what is faster?
Dataframe
using pandas.DataFrame.to_gbq()
functionDataframe
as CSV and then upload it as a file to BigQuery using the Python APIDataframe
as CSV and then upload the file to Google Cloud Storage using this procedure and then reading it from BigQueryUpdate:
Alternative 1 seems faster than Alternative 2 , (using pd.DataFrame.to_csv()
and load_data_from_file()
17.9 secs more in average with 3 loops
):
def load_data_from_file(dataset_id, table_id, source_file_name):
bigquery_client = bigquery.Client()
dataset_ref = bigquery_client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
with open(source_file_name, 'rb') as source_file:
# This example uses CSV, but you can use other formats.
# See https://cloud.google.com/bigquery/loading-data
job_config = bigquery.LoadJobConfig()
job_config.source_format = 'text/csv'
job_config.autodetect=True
job = bigquery_client.load_table_from_file(
source_file, table_ref, job_config=job_config)
job.result() # Waits for job to complete
print('Loaded {} rows into {}:{}.'.format(
job.output_rows, dataset_id, table_id))
Upvotes: 45
Views: 95169
Reputation: 1115
Having also had performance issues with to_gbq() I just tried the native google client and it's miles faster (approx 4x), and if you omit the step where you wait for the result, it's approx 20x faster.
Worth noting that best practice would be to wait for the result and check it, but in my case there's extra steps later on that validate the results.
I'm using pandas_gbq version 0.15 (the latest at the time of writing). Try this:
from google.cloud import bigquery
import pandas
df = pandas.DataFrame(
{
'my_string': ['a', 'b', 'c'],
'my_int64': [1, 2, 3],
'my_float64': [4.0, 5.0, 6.0],
'my_timestamp': [
pandas.Timestamp("1998-09-04T16:03:14"),
pandas.Timestamp("2010-09-13T12:03:45"),
pandas.Timestamp("2015-10-02T16:00:00")
],
}
)
client = bigquery.Client()
table_id = 'my_dataset.new_table'
# Since string columns use the "object" dtype, pass in a (partial) schema
# to ensure the correct BigQuery data type.
job_config = bigquery.LoadJobConfig(schema=[
bigquery.SchemaField("my_string", "STRING"),
])
job = client.load_table_from_dataframe(
df, table_id, job_config=job_config
)
# Wait for the load job to complete. (I omit this step)
# job.result()
Upvotes: 15
Reputation: 1714
I did the comparison for alternative 1 and 3 in Datalab
using the following code:
from datalab.context import Context
import datalab.storage as storage
import datalab.bigquery as bq
import pandas as pd
from pandas import DataFrame
import time
# Dataframe to write
my_data = [{1,2,3}]
for i in range(0,100000):
my_data.append({1,2,3})
not_so_simple_dataframe = pd.DataFrame(data=my_data,columns=['a','b','c'])
#Alternative 1
start = time.time()
not_so_simple_dataframe.to_gbq('TestDataSet.TestTable',
Context.default().project_id,
chunksize=10000,
if_exists='append',
verbose=False
)
end = time.time()
print("time alternative 1 " + str(end - start))
#Alternative 3
start = time.time()
sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'
# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)
# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_dataframe(not_so_simple_dataframe)
# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable not_so_simple_dataframe --object $sample_bucket_object
# Write the DataFrame to a BigQuery table
table.insert_data(not_so_simple_dataframe)
end = time.time()
print("time alternative 3 " + str(end - start))
and here are the results for n = {10000,100000,1000000}:
n alternative_1 alternative_3
10000 30.72s 8.14s
100000 162.43s 70.64s
1000000 1473.57s 688.59s
Judging from the results, alternative 3 is faster than alternative 1.
Upvotes: 22