Pete M
Pete M

Reputation: 194

BigQuery Client Load Table from URI in Python doesn't work but using Pandas does

I have an ETL process that executes as a cloud function that reads a csv from a cloud storage bucket and writes it to a BigQuery Table.

def gcs_to_table(uri,table_id,project_id):

    logger.info(f'Writing data from {uri} to {table_id}.')
    
    # Construct a BigQuery client object.
    client = bigquery.Client()
    
    job_config = bigquery.LoadJobConfig(
        write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE,
        source_format=bigquery.SourceFormat.CSV,
        skip_leading_rows=1,
        allow_jagged_rows=True,
        max_bad_records=100,
    )
    
    load_job = client.load_table_from_uri(
        uri, table_id, job_config=job_config
    )  
    
    load_job.result()  # Waits for the job to complete.
    
    destination_table = client.get_table(table_id)
    logger.info(f"Loaded {destination_table.num_rows} rows to table {table_id}".format())

This solution has been failing due to exceeding the 100 error limit, usually with an error like this one:

message: Error while reading data, error message: Unable to parse; byte_offset_to_start_of_line: 506943943 column_index: 11 column_name: "SomeColumn" column_type: DOUBLE value: "SomeString"

"SomeString" belongs in the previous column, so the columns are somehow getting misaligned during the load job. However, using notepad++, I've looked at the raw file and checked for obvious formatting issues with the csv such as mismatched quote characters or commas inside of string values, and I don't see anything that could be causing it. I suspect this has something to do with my configuration and not the format of the CSV since when I run the process using pandas and pandas_gbq on that same CSV it works just fine:

def gcs_to_table(gcs_object,table_name,project_id):
    
    logger.info(f'Writing data from {gcs_object} to {table_name}.')
    
    dat = pd.read_csv(gcs_object)   
    
    dat.to_gbq(table_name, if_exists='replace')
    logger.info(f'Wrote DataFrame with shape: {dat.shape}.')
    

The pandas solution is much slower and can run into timeout issues, so I would prefer to keep using BigQuery client if I can get these issues resolved.

Any wisdom or even troubleshooting steps are greatly appreciated!

Upvotes: 0

Views: 46

Answers (0)

Related Questions