Eli
Eli

Reputation: 38899

Loading a Lot of Data into Google Bigquery from Python

I've been struggling to load big chunks of data into bigquery for a little while now. In Google's docs, I see the insertAll method, which seems to work fine, but gives me 413 "Entity too large" errors when I try to send anything over about 100k of data in JSON. Per Google's docs, I should be able to send up to 1TB of uncompressed data in JSON. What gives? The example on the previous page has me building the request body manually instead of using insertAll, which is uglier and more error prone. I'm also not sure what format the data should be in in that case.

So, all of that said, what is the clean/proper way of loading lots of data into Bigquery? An example with data would be great. If at all possible, I'd really rather not build the request body myself.

Upvotes: 1

Views: 3961

Answers (2)

Jordan Tigani
Jordan Tigani

Reputation: 26617

The example here uses the resumable upload to upload a CSV file. While the file used is small, it should work for virtually any size upload since it uses a robust media upload protocol. It sounds like you want json, which means you'd need to tweak the code slightly for json (an example for json is in the load_json.py example in the same directory). If you have a stream you want to upload instead of a file, you can use a MediaInMemoryUpload instead of the MediaFileUpload that is used in the example.

BTW ... Craig's answer is correct, I just thought I'd chime in with links to sample code.

Upvotes: 1

Craig Citro
Craig Citro

Reputation: 6625

Note that for streaming data to BQ, anything above 10k rows/sec requires talking to a sales rep.

If you'd like to send large chunks directly to BQ, you can send it via POST. If you're using a client library, it should handle making the upload resumable for you. To do this, you'll need to make a call to jobs.insert() instead of tabledata.insertAll(), and provide a description of a load job. To actually push the bytes using the Python client, you can create a MediaFileUpload or MediaInMemoryUpload and pass it as the media_body parameter.

The other option is to stage the data in Google Cloud Storage and load it from there.

Upvotes: 5

Related Questions