fragilewindows
fragilewindows

Reputation: 1412

Check if data already exists before inserting into BigQuery table (using Python)

I am setting up a daily cron job that appends a row to BigQuery table (using Python), however, duplicate data is being inserted. I have searched online and I know that there is a way to manually remove duplicate data, but I wanted to see if I could avoid this duplication in the first place.

Is there a way to check a BigQuery table to see if a data record already exists first in order to avoid inserting duplicate data? Thanks.

CODE SNIPPET:

import webapp2
import logging
from googleapiclient import discovery
from oath2client.client import GoogleCredentials

PROJECT_ID = 'foo'
DATASET_ID = 'bar'
TABLE_ID = 'foo_bar_table’

class UpdateTableHandler(webapp2.RequestHandler):
    def get(self):
        credentials = GoogleCredentials.get_application_default()
        service = discovery.build('bigquery', 'v2', credentials=credentials)

    try:

     the_fruits = Stuff.query(Stuff.fruitTotal >= 5).filter(Stuff.fruitColor == 'orange').fetch();

     for fruit in the_fruits:
       #some code here

     basket = dict()
     basket['id'] = fruit.fruitId
     basket['Total'] = fruit.fruitTotal
     basket['PrimaryVitamin'] = fruit.fruitVitamin
     basket['SafeRaw'] = fruit.fruitEdibleRaw
     basket['Color'] = fruit.fruitColor
     basket['Country'] = fruit.fruitCountry

            body = {
                'rows': [
                    {
                        'json': basket,
                        'insertId': str(uuid.uuid4())
                    }
                ]
            }

            response = bigquery_service.tabledata().insertAll(projectId=PROJECT_ID,
                                                              datasetId=DATASET_ID,
                                                              tableId=TABLE_ID,
                                                              body=body).execute(num_retries=5)
            logging.info(response)

    except Exception, e:
        logging.error(e)

app = webapp2.WSGIApplication([
    ('/update_table', UpdateTableHandler),
], debug=True)

Upvotes: 5

Views: 5236

Answers (1)

Jeremy Condit
Jeremy Condit

Reputation: 7046

The only way to test whether the data already exists is to run a query.

If you have lots of data in the table, that query could be expensive, so in most cases we suggest you go ahead and insert the duplicate, and then merge duplicates later on.

As Zig Mandel suggests in a comment, you can query over a date partition if you know the date when you expect to see the record, but that may still be expensive compared to inserting and removing duplicates.

Upvotes: 2

Related Questions