Reputation: 11
I'm using the following script to select from a bigquery table, then loop over the query job result rows and write to a csv file one row at a time.
My problem is that for some query jobs, the total number of rows written matches the expected, but with some of the rows duplicated and others missing. For example, the query itself may return 25k rows and no duplicates. But of the 25k rows written, there are 15 records duplicated and 15 records missing when compared with the query result.
One file in question is over 3Gb before compressing, but for other queries (selecting the same columns but a different visitStartTime range) the file will be even larger yet have no duplicate issue. Is there a reason why for some query jobs some of the records would be written in duplicate and others not written at all?
from google.cloud import bigquery
from google.oauth2 import service_account
import csv
query_string = """select c1,c2,c3,c4,c5
from `mydb.mydataset.mytable_20211212`
where visitStartTime >= 1639350000.0 AND
visitStartTime < 1639353600.0"""
credentials = service_account.Credentials.from_service_account_file(key_path, scopes=["https://www.googleapis.com/auth/cloud-platform"],)
client = bigquery.Client(credentials=credentials, project=credentials.project_id, )
query_job = client.query(query_string)
with open('myfilename.csv', 'w', newline='', encoding="utf-8") as csvfile:
writer = csv.writer(csvfile, quoting=csv.QUOTE_NONE, delimiter='|', quotechar='', escapechar='\\')
# write header row
writer.writerow(["c1","c2","c3","c4","c5"])
# write data rows
for row in query_job:
writer.writerow([row.c1, row.c2, row.c3, row.c4, row.c5])
Upvotes: 0
Views: 790
Reputation: 150
From what I understand in your question, if you have 15 records duplicated you’ll have another 15 missing (the same number). If the CSV result is different from the query, even though the CSV is the exact query, then the problem must be in the CSV writing. I suggest you look at the exporting data documentation and take a look at the given examples. Due to the size you’re working with, I would start by trying to extract the compressed table to a bucket:
from google.cloud import bigquery
client = bigquery.Client()
bucket_name = '[BUCKET_NAME]'
destination_uri = "gs://{}/{}".format(bucket_name, "[TABLE_NAME].csv.gz")
dataset_ref = bigquery.DatasetReference(project, dataset_id)
table_ref = dataset_ref.table("[TABLE_NAME]")
job_config = bigquery.job.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
extract_job = client.extract_table(
table_ref,
destination_uri,
location="[LOCATION]",
job_config=job_config
)
extract_job.result()
And if this doesn’t work I would try to export the data in multiple files .
If any of this works as you want, then most probably the issue is in how the query_string
is written. Have you considered using DISTINCT
or to set up more filters?
Also, the problem may also be in the data exported. Have you tried, seeing the “wrong data” exported, to set the visitStartTime
to only export this “wrong data”?
Upvotes: 0