Reputation: 83
I am working to download some data from BigQuery and did some coding. As I am not the owner and don't have permission on BigQuery creation of object, then I have used the standard SQL query to load the 21 million rows of records. The performance is really bad (Windows+Anaconda) and take at least 1 hour to download 1 million even using pyarrow with no progressing at all.
I may need to do some ETL and write back to a database table. I am not sure whether the performance will be improved by downloading target table(s) to a database directly.
Would anyone gives me any clues? Thanks in advance. Regards, AM
Here is my coding:
import google.auth
from google.cloud import bigquery
from google.oauth2 import service_account
from pandas.io import gbq
credentials = service_account.Credentials.from_service_account_file('D:/secure/myprojid-123a1285876z.json')
project_id = 'myprojid'
client = bigquery.Client(credentials= credentials,project=project_id)
query_job = client.query("""
SELECT *
FROM big_query_schema.bg_table_name_01
""")
results = query_job.result() # Waits for job to complete.
df=results.to_dataframe()
df.to_csv('D:/temp/out.txt',index=False, sep='|')
Upvotes: 1
Views: 106
Reputation: 4166
The problem is probably that you are trying to load the entire dataframe into memory before writing it out.
Two solutions:
Upvotes: 1