Some records are duplicated when writing BigQuery query result rows to csv file

Question

I'm using the following script to select from a bigquery table, then loop over the query job result rows and write to a csv file one row at a time.

My problem is that for some query jobs, the total number of rows written matches the expected, but with some of the rows duplicated and others missing. For example, the query itself may return 25k rows and no duplicates. But of the 25k rows written, there are 15 records duplicated and 15 records missing when compared with the query result.

One file in question is over 3Gb before compressing, but for other queries (selecting the same columns but a different visitStartTime range) the file will be even larger yet have no duplicate issue. Is there a reason why for some query jobs some of the records would be written in duplicate and others not written at all?

from google.cloud import bigquery
from google.oauth2 import service_account
import csv

query_string = """select c1,c2,c3,c4,c5 
                  from `mydb.mydataset.mytable_20211212` 
                  where visitStartTime >= 1639350000.0 AND 
                  visitStartTime < 1639353600.0"""

credentials = service_account.Credentials.from_service_account_file(key_path, scopes=["https://www.googleapis.com/auth/cloud-platform"],)
client = bigquery.Client(credentials=credentials, project=credentials.project_id, )
query_job = client.query(query_string)

with open('myfilename.csv', 'w', newline='', encoding="utf-8") as csvfile:

    writer = csv.writer(csvfile, quoting=csv.QUOTE_NONE, delimiter='|', quotechar='', escapechar='\')

    # write header row
    writer.writerow(["c1","c2","c3","c4","c5"])

    # write data rows
    for row in query_job:
        writer.writerow([row.c1, row.c2, row.c3, row.c4, row.c5])

Some records are duplicated when writing BigQuery query result rows to csv file

Answers (1)

Related Questions