Reputation: 564
I'm trying to load a JSON object to a BigQuery table using Python 3.7.
From reading the Google documentation the google-cloud-bigquery
module looks like it has a method that should do what I want: load_table_from_json()
. However, when I try to implement this method in my Python script I get the following error returned in the Python shell:
BadRequest: 400 Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 1; errors: 1. Please look into the errors[] collection for more details.
When I check the job history in BigQuery I have some additional information also:
Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 1; errors: 1; max bad: 0; error percent: 0
And
Error while reading data, error message: JSON parsing error in row starting at position 0: Value encountered without start of object.
Here's the syntax of the Python script I'm running:
import pandas as pd
import numpy as np
from google.cloud import bigquery
import os
### Converts schema dictionary to BigQuery's expected format for job_config.schema
def format_schema(schema):
formatted_schema = []
for row in schema:
formatted_schema.append(bigquery.SchemaField(row['name'], row['type'], row['mode']))
return formatted_schema
### Create dummy data to load
df = pd.DataFrame([[2, 'Jane', 'Doe']],
columns=['id', 'first_name', 'last_name'])
### Convert dataframe to JSON object
json_data = df.to_json(orient = 'records')
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = r"<My_Credentials_Path>\application_default_credentials.json"
### Define schema as on BigQuery table, i.e. the fields id, first_name and last_name
table_schema = {
'name': 'id',
'type': 'INTEGER',
'mode': 'REQUIRED'
}, {
'name': 'first_name',
'type': 'STRING',
'mode': 'NULLABLE'
}, {
'name': 'last_name',
'type': 'STRING',
'mode': 'NULLABLE'
}
project_id = '<my_project>'
dataset_id = '<my_dataset>'
table_id = '<my_table>'
client = bigquery.Client(project = project_id)
dataset = client.dataset(dataset_id)
table = dataset.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.schema = format_schema(table_schema)
job = client.load_table_from_json(json_data, table, job_config = job_config)
print(job.result())
As far as I can tell from the docs, this should work - but it doesn't.
I suspect the issue is with the JSON object json_data
and that BigQuery might not like something about that value: [{"id":2,"first_name":"Jane","last_name":"Doe"}]
. Even if I pass the lines
parameter with a value of True
it makes no difference - that JSON object has no square brackets: {"id":2,"first_name":"Jane","last_name":"Doe"}
I also tried using the alternate values for the orient
parameter as described in the to_json()
documentation, but they all produce the same error as the records
value in the above.
I've also tried commenting out the line job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
but got the same error.
Interestingly, in the load_table_from_json()
documentation there is a note which states:
If your data is already a newline-delimited JSON string, it is best to wrap it into a file-like object and pass it to load_table_from_file():
import io
from google.cloud import bigquery
data = u'{"foo": "bar"}'
data_as_file = io.StringIO(data)
client = bigquery.Client()
client.load_table_from_file(data_as_file, ...)
If I apply this to my script and attempt to load the data it all works. This indicates that the connection to BigQuery is working correctly and that the issue is indeed with loading the JSON in its original form. There is no mention of deprecation for load_table_from_json()
in favour of load_table_from_file()
so why isn't it working?
For reference, here's a version of the script that uses the load_table_from_file()
method to load the data to BigQuery:
import pandas as pd
import numpy as np
from google.cloud import bigquery
import os
import io
def format_schema(schema):
formatted_schema = []
for row in schema:
formatted_schema.append(bigquery.SchemaField(row['name'], row['type'], row['mode']))
return formatted_schema
df = pd.DataFrame([[2, 'Jane', 'Doe']],
columns=['id', 'first_name', 'last_name'])
### Additional parameter used to convert to newline delimited format
json_data = df.to_json(orient = 'records', lines = True)
stringio_data = io.StringIO(json_data)
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = r"<My_Credentials_Path>\application_default_credentials.json"
table_schema = {
'name': 'id',
'type': 'INTEGER',
'mode': 'REQUIRED'
}, {
'name': 'first_name',
'type': 'STRING',
'mode': 'NULLABLE'
}, {
'name': 'last_name',
'type': 'STRING',
'mode': 'NULLABLE'
}
project_id = '<my_project>'
dataset_id = '<my_dataset>'
table_id = '<my_table>'
client = bigquery.Client(project = project_id)
dataset = client.dataset(dataset_id)
table = dataset.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.schema = format_schema(table_schema)
job = client.load_table_from_file(stringio_data, table, job_config = job_config)
print(job.result())
Upvotes: 11
Views: 47701
Reputation: 4085
The function client.load_table_from_file
expects a JSON
object instead of a STRING
To fix it you can do:
import json
After creating your JSON string from Pandas, you should do:
json_object = json.loads(json_data)
And in the end you should use your JSON Object:
job = client.load_table_from_json(json_object, table, job_config = job_config)
So your code will be like this:
import pandas as pd
import numpy as np
from google.cloud import bigquery
import os, json
### Converts schema dictionary to BigQuery's expected format for job_config.schema
def format_schema(schema):
formatted_schema = []
for row in schema:
formatted_schema.append(bigquery.SchemaField(row['name'], row['type'], row['mode']))
return formatted_schema
### Create dummy data to load
df = pd.DataFrame([[2, 'Jane', 'Doe']],
columns=['id', 'first_name', 'last_name'])
### Convert dataframe to JSON object
json_data = df.to_json(orient = 'records')
json_object = json.loads(json_data)
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = r"<My_Credentials_Path>\application_default_credentials.json"
### Define schema as on BigQuery table, i.e. the fields id, first_name and last_name
table_schema = {
'name': 'id',
'type': 'INTEGER',
'mode': 'REQUIRED'
}, {
'name': 'first_name',
'type': 'STRING',
'mode': 'NULLABLE'
}, {
'name': 'last_name',
'type': 'STRING',
'mode': 'NULLABLE'
}
project_id = '<my_project>'
dataset_id = '<my_dataset>'
table_id = '<my_table>'
client = bigquery.Client(project = project_id)
dataset = client.dataset(dataset_id)
table = dataset.table(table_id)
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
job_config.schema = format_schema(table_schema)
job = client.load_table_from_json(json_object, table, job_config = job_config)
print(job.result())
Please, let me know if it helps you
Upvotes: 15