Reputation: 779
Hello this is a 2 part question
1) Currently I am trying to upload a file from google cloud storage to bigquery via a python script. I am trying to follow the steps given by the google help site.
https://cloud.google.com/bigquery/docs/loading-data-cloud-storage#bigquery-import-gcs-file-python
def load_data_from_gcs(dataset_name, table_name, source):
bigquery_client = bigquery.Client()
dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(table_name)
job_name = str(uuid.uuid4())
job = bigquery_client.load_table_from_storage(
job_name, table, source)
job.begin()
wait_for_job(job)
print('Loaded {} rows into {}:{}.'.format(
job.output_rows, dataset_name, table_name))
I am not sure what to put in for the first line of "load_data_from_gcs" because in google cloud there are no tables it is JSON file I am trying to upload. Would the "table" part be the name of the table I am trying to create or is it talking about the bucket because there is no part to specify which bucket I want to pull from.
This is the code I have so far.
import json
import argparse
import time
import uuid
from google.cloud import bigquery
# from google.cloud import storage
def load_data_from_gcs('dataworks-356fa', table_name, 'pullnupload.json'):
bigquery_client = bigquery.Client('dataworks-356fa')
dataset = bigquery_client.dataset('FirebaseArchive')
table = dataset.table(table_name)
job_name = str(uuid.uuid4())
job = bigquery_client.load_table_from_storage(
job_name, table, source)
job.begin()
wait_for_job(job)
print('Loaded {} rows into {}:{}.'.format(
job.output_rows, dataset_name, table_name))
part 2) I want this script to run weekly and be able to either delete the old table and create a new one or either only filter in the non-duplicated data. Whichever is easier.
Thank you for your help.
Upvotes: 0
Views: 345
Reputation: 11787
Not sure what problem you are having but to load data from a file from GCS to BigQuery is exactly how you are already doing.
If you have a table with this schema:
[{"name": "id", "type": "INT64"}, {"name": "name", "type": "STRING"}]
And if you have this file in GCS (located for instance at "gs://bucket/json_data.json"):
{"id": 1, "name": "test1"}
{"id": 2, "name": "test2"}
You'd just need now to set the job
object to process a JSON file as input, like so:
def load_data_from_gcs('dataworks-356fa', table_name, 'pullnupload.json'):
bigquery_client = bigquery.Client('dataworks-356fa')
dataset = bigquery_client.dataset('FirebaseArchive')
table = dataset.table(table_name)
job_name = str(uuid.uuid4())
job = bigquery_client.load_table_from_storage(
job_name, table, "gs://bucket/json_data.json")
job.source_format = 'NEWLINE_DELIMITED_JSON'
job.begin()
And just it.
(If you have a CSV file then you have to set your job
object accordingly).
As for the second question, it's really a matter of trying it out different approaches and seeing which works best for you.
To delete a table, you'd just need to run:
table.delete()
To remove duplicated data from a table one possibility would be to write a query that removes the duplication and saves the results to the same table. Something like:
query_job = bigquery_client.run_async_query(query=your_query, job_name=job_name)
query_job.destination = Table object
query_job.write_disposition = 'WRITE_TRUNCATE'
query_job.begin()
Upvotes: 1