Fab Dot
Fab Dot

Reputation: 564

Can you load JSON formatted data to a BigQuery table using Python and load_table_from_json()?

Issue:

I'm trying to load a JSON object to a BigQuery table using Python 3.7.

From reading the Google documentation the google-cloud-bigquery module looks like it has a method that should do what I want: load_table_from_json(). However, when I try to implement this method in my Python script I get the following error returned in the Python shell:

BadRequest: 400 Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.

When I check the job history in BigQuery I have some additional information also:

Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 1; errors: 1; max bad: 0; error percent: 0

And

Error while reading data, error message: JSON parsing error in row starting at position 0: Value encountered without start of object.

Here's the syntax of the Python script I'm running:

import pandas as pd
import numpy as np
from google.cloud import bigquery
import os

### Converts schema dictionary to BigQuery's expected format for job_config.schema
def format_schema(schema):
    formatted_schema = []
    for row in schema:
        formatted_schema.append(bigquery.SchemaField(row['name'], row['type'], row['mode']))
    return formatted_schema

### Create dummy data to load
df = pd.DataFrame([[2, 'Jane', 'Doe']],
columns=['id', 'first_name', 'last_name'])

### Convert dataframe to JSON object
json_data = df.to_json(orient = 'records')

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = r"<My_Credentials_Path>\application_default_credentials.json"

### Define schema as on BigQuery table, i.e. the fields id, first_name and last_name   
table_schema = {
          'name': 'id',
          'type': 'INTEGER',
          'mode': 'REQUIRED'
          }, {
          'name': 'first_name',
          'type': 'STRING',
          'mode': 'NULLABLE'
          }, {
          'name': 'last_name',
          'type': 'STRING',
          'mode': 'NULLABLE'
          }

project_id = '<my_project>'
dataset_id = '<my_dataset>'
table_id = '<my_table>'

client  = bigquery.Client(project = project_id)
dataset  = client.dataset(dataset_id)
table = dataset.table(table_id)

job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.schema = format_schema(table_schema)
job = client.load_table_from_json(json_data, table, job_config = job_config)

print(job.result())

As far as I can tell from the docs, this should work - but it doesn't.

I suspect the issue is with the JSON object json_data and that BigQuery might not like something about that value: [{"id":2,"first_name":"Jane","last_name":"Doe"}]. Even if I pass the lines parameter with a value of True it makes no difference - that JSON object has no square brackets: {"id":2,"first_name":"Jane","last_name":"Doe"}

I also tried using the alternate values for the orient parameter as described in the to_json() documentation, but they all produce the same error as the records value in the above.

I've also tried commenting out the line job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON but got the same error.

Workaround

Interestingly, in the load_table_from_json() documentation there is a note which states:

If your data is already a newline-delimited JSON string, it is best to wrap it into a file-like object and pass it to load_table_from_file():

import io
from google.cloud import bigquery

data = u'{"foo": "bar"}'
data_as_file = io.StringIO(data)

client = bigquery.Client()
client.load_table_from_file(data_as_file, ...)

If I apply this to my script and attempt to load the data it all works. This indicates that the connection to BigQuery is working correctly and that the issue is indeed with loading the JSON in its original form. There is no mention of deprecation for load_table_from_json() in favour of load_table_from_file() so why isn't it working?

Working code:

For reference, here's a version of the script that uses the load_table_from_file() method to load the data to BigQuery:

import pandas as pd
import numpy as np
from google.cloud import bigquery
import os
import io

def format_schema(schema):
    formatted_schema = []
    for row in schema:
        formatted_schema.append(bigquery.SchemaField(row['name'], row['type'], row['mode']))   
    return formatted_schema

df = pd.DataFrame([[2, 'Jane', 'Doe']],
columns=['id', 'first_name', 'last_name'])

### Additional parameter used to convert to newline delimited format
json_data = df.to_json(orient = 'records', lines = True)
stringio_data = io.StringIO(json_data)

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = r"<My_Credentials_Path>\application_default_credentials.json"

table_schema = {
          'name': 'id',
          'type': 'INTEGER',
          'mode': 'REQUIRED'
          }, {
          'name': 'first_name',
          'type': 'STRING',
          'mode': 'NULLABLE'
          }, {
          'name': 'last_name',
          'type': 'STRING',
          'mode': 'NULLABLE'
          }

project_id = '<my_project>'
dataset_id = '<my_dataset>'
table_id = '<my_table>'

client = bigquery.Client(project = project_id)
dataset = client.dataset(dataset_id)
table = dataset.table(table_id)

job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.schema = format_schema(table_schema)

job = client.load_table_from_file(stringio_data, table, job_config = job_config)
print(job.result())

Upvotes: 11

Views: 47701

Answers (1)

rmesteves
rmesteves

Reputation: 4085

The function client.load_table_from_file expects a JSON object instead of a STRING To fix it you can do:

import json

After creating your JSON string from Pandas, you should do:

json_object = json.loads(json_data)

And in the end you should use your JSON Object:

job = client.load_table_from_json(json_object, table, job_config = job_config)

So your code will be like this:

import pandas as pd
import numpy as np
from google.cloud import bigquery
import os, json

### Converts schema dictionary to BigQuery's expected format for job_config.schema
def format_schema(schema):
    formatted_schema = []
    for row in schema:
        formatted_schema.append(bigquery.SchemaField(row['name'], row['type'], row['mode']))
    return formatted_schema

### Create dummy data to load
df = pd.DataFrame([[2, 'Jane', 'Doe']],
columns=['id', 'first_name', 'last_name'])

### Convert dataframe to JSON object
json_data = df.to_json(orient = 'records')
json_object = json.loads(json_data)

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = r"<My_Credentials_Path>\application_default_credentials.json"

### Define schema as on BigQuery table, i.e. the fields id, first_name and last_name   
table_schema = {
          'name': 'id',
          'type': 'INTEGER',
          'mode': 'REQUIRED'
          }, {
          'name': 'first_name',
          'type': 'STRING',
          'mode': 'NULLABLE'
          }, {
          'name': 'last_name',
          'type': 'STRING',
          'mode': 'NULLABLE'
          }

project_id = '<my_project>'
dataset_id = '<my_dataset>'
table_id = '<my_table>'

client  = bigquery.Client(project = project_id)
dataset  = client.dataset(dataset_id)
table = dataset.table(table_id)

job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.schema = format_schema(table_schema)
job = client.load_table_from_json(json_object, table, job_config = job_config)

print(job.result())

Please, let me know if it helps you

Upvotes: 15

Related Questions