W. Stephens
W. Stephens

Reputation: 779

Confusion when uploading a JSON from googlecloud storage to bigquery

Hello this is a 2 part question

1) Currently I am trying to upload a file from google cloud storage to bigquery via a python script. I am trying to follow the steps given by the google help site.

https://cloud.google.com/bigquery/docs/loading-data-cloud-storage#bigquery-import-gcs-file-python

def load_data_from_gcs(dataset_name, table_name, source):
    bigquery_client = bigquery.Client()
    dataset = bigquery_client.dataset(dataset_name)
    table = dataset.table(table_name)
    job_name = str(uuid.uuid4())

    job = bigquery_client.load_table_from_storage(
        job_name, table, source)

    job.begin()

    wait_for_job(job)

    print('Loaded {} rows into {}:{}.'.format(
        job.output_rows, dataset_name, table_name))

I am not sure what to put in for the first line of "load_data_from_gcs" because in google cloud there are no tables it is JSON file I am trying to upload. Would the "table" part be the name of the table I am trying to create or is it talking about the bucket because there is no part to specify which bucket I want to pull from.

This is the code I have so far.

import json
import argparse
import time
import uuid

from google.cloud import bigquery
# from google.cloud import storage

def load_data_from_gcs('dataworks-356fa', table_name, 'pullnupload.json'):
    bigquery_client = bigquery.Client('dataworks-356fa')
    dataset = bigquery_client.dataset('FirebaseArchive')
    table = dataset.table(table_name)
    job_name = str(uuid.uuid4())

    job = bigquery_client.load_table_from_storage(
        job_name, table, source)

    job.begin()

    wait_for_job(job)

    print('Loaded {} rows into {}:{}.'.format(
        job.output_rows, dataset_name, table_name))

part 2) I want this script to run weekly and be able to either delete the old table and create a new one or either only filter in the non-duplicated data. Whichever is easier.

Thank you for your help.

Upvotes: 0

Views: 345

Answers (1)

Willian Fuks
Willian Fuks

Reputation: 11787

Not sure what problem you are having but to load data from a file from GCS to BigQuery is exactly how you are already doing.

If you have a table with this schema:

[{"name": "id", "type": "INT64"}, {"name": "name", "type": "STRING"}]

And if you have this file in GCS (located for instance at "gs://bucket/json_data.json"):

{"id": 1, "name": "test1"}
{"id": 2, "name": "test2"}

You'd just need now to set the job object to process a JSON file as input, like so:

def load_data_from_gcs('dataworks-356fa', table_name, 'pullnupload.json'):
    bigquery_client = bigquery.Client('dataworks-356fa')
    dataset = bigquery_client.dataset('FirebaseArchive')
    table = dataset.table(table_name)
    job_name = str(uuid.uuid4())

    job = bigquery_client.load_table_from_storage(
        job_name, table, "gs://bucket/json_data.json")

    job.source_format = 'NEWLINE_DELIMITED_JSON'
    job.begin()

And just it.

(If you have a CSV file then you have to set your job object accordingly).

As for the second question, it's really a matter of trying it out different approaches and seeing which works best for you.

To delete a table, you'd just need to run:

table.delete()

To remove duplicated data from a table one possibility would be to write a query that removes the duplication and saves the results to the same table. Something like:

query_job = bigquery_client.run_async_query(query=your_query, job_name=job_name)
query_job.destination = Table object
query_job.write_disposition = 'WRITE_TRUNCATE'
query_job.begin()

Upvotes: 1

Related Questions