jaz713
jaz713

Reputation: 53

How can I write a query to insert array values from a python dictionary in BigQuery?

I have a python dictionary that looks like this:

{
'id': 123, 
'categories': [
    {'category': 'fruit', 'values': ['apple', 'banana']}, 
    {'category': 'animal', 'values': ['cat']},
    {'category': 'plant', 'values': []}
  ]
}

I am trying to insert those values into a table in big query via the API using python, I just need to format the above into an "INSERT table VALUES" query. The table needs to have the fields: id, categories.category, categories.values.

I need categories to basically be an array with the category and each category's corresponding values. The table is supposed to look sort of like this in the end - except I need it to be just one row per id, with the corresponding category fields nested and having the proper field name:

enter image description here

SELECT 123 as id, (["fruit"], ["apple", "banana"]) as category
UNION ALL (SELECT 123 as id, (["animal"], ["cat"]) as category)
UNION ALL (SELECT 123 as id, (["plant"], ["tree", "bush", "rose"]) as category)

I'm not really sure how to format the "INSERT" query to get the desired result, can anyone help?

Upvotes: 3

Views: 7774

Answers (2)

Alexandre Moraes
Alexandre Moraes

Reputation: 4032

If you want to load a dictionary to BigQuery using Python, you have first to prepare your data. I chose to convert the Python Dictionary to a .json file and then load it to BigQuery using the Python API. However, according to the documentation, BigQuery has some limitations regarding loading .json nested data, among them:

  1. Your .json must be a new line delimited, which means that each object must be in a new line in the file
  2. BigQuery does not support maps or dictionaries in Json. Thus, in order to do so, you have to wrap your whole data in [], as you can see here.

For this reason, some modifications should be done in the file, so you can load the created .json file to BiguQuery. I have created two scripts, in which: the first one converts the Python dict to a JSON file and the second the JSON file is formatted as New Line delimited json and then loaded in BigQuery.

Convert the python dict to a .json file. Notice that you have to wrap the whole data between []:

import json
from google.cloud import bigquery


py_dict =[{
'id': 123, 
'categories': [
    {'category': 'fruit', 'values': ['apple', 'banana']}, 
    {'category': 'animal', 'values': ['cat']},
    {'category': 'plant', 'values': []}
  ]
}]

json_data = json.dumps(py_dict, sort_keys=True)
out_file =  open('json_data.json','w+')
json.dump(py_dict,out_file)

Second, convert json to new line delimited json and load to BigQuery:

import json
from google.cloud import bigquery

with open("json_data.json", "r") as read_file:
    data = json.load(read_file)
result = [json.dumps(record) for record in data]
with open('nd-proceesed.json', 'w') as obj:
    for i in result:
        obj.write(i+'\n')
       

client = bigquery.Client()
filename = '/path/to/file.csv'
dataset_id = 'sample'
table_id = 'json_mytable'

dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.autodetect = True

with open("nd-proceesed.json", "rb") as source_file:
    job = client.load_table_from_file(source_file, table_ref, job_config=job_config)

job.result()  # Waits for table load to complete.

print("Loaded {} rows into {}:{}.".format(job.output_rows, dataset_id, table_id))

Then, in the BigQuery UI, you can query your table as follows:

SELECT id, categories
FROM `test-proj-261014.sample.json_mytable4` , unnest(categories) as categories

And the output:

enter image description here

Upvotes: 11

Mikhail Berlyant
Mikhail Berlyant

Reputation: 172944

You can use below query - with your dictionary text embed into it

#standardSQL
WITH data AS (
SELECT '''
  {
  'id': 123, 
  'categories': [
      {'category': 'fruit', 'values': ['apple', 'banana']}, 
      {'category': 'animal', 'values': ['cat']},
      {'category': 'plant', 'values': ['tree', 'bush', 'rose']}
    ]
  } 
  ''' dict
)
SELECT 
  JSON_EXTRACT_SCALAR(dict, '$.id') AS id,
  ARRAY(
    SELECT AS STRUCT 
      JSON_EXTRACT_SCALAR(cat, '$.category') AS category,
      ARRAY(
        SELECT TRIM(val, '"')
        FROM UNNEST(JSON_EXTRACT_ARRAY(cat, '$.values')) val
      )`values`
    FROM UNNEST(JSON_EXTRACT_ARRAY(dict, '$.categories')) cat
  ) AS categories 
FROM data   

which produces below result

Row id  categories.category categories.values    
1   123 fruit               apple    
                            banana   
        animal              cat  
        plant               tree     
                            bush     
                            rose     

Upvotes: 1

Related Questions