David542
David542

Reputation: 110452

Get a massive csv file from GCS to BQ

I have a very large CSV file (let's say 1TB) that I need to get from GCS onto BQ. While BQ does have a CSV-loader, the CSV files that I have are pretty non-standard and don't end up loading properly to BQ without formatting it.

Normally I would download the csv file onto a server to 'process it' and save it either directly to BQ or to an avro file that can be ingested easily by BQ. However, the file(s) are quite large and it's quite possible (and probably) that I wouldn't have the storage/memory to do the batch processing without writing a lot of code to optimize/stream it.

Is this a good use case for using Cloud Dataflow? Are there any tutorials are ways to go about getting a file of format "X" from GCS into BQ? Any tutorial pointers or example scripts to do so would be great.

Upvotes: 1

Views: 1481

Answers (3)

Graham Polley
Graham Polley

Reputation: 14791

I'd personally use Dataflow (not Dataprep) and write a simple pipeline to read the file in parallel, clean/transform it, and finally write it to BigQuery. It's pretty straightforward. Here's an example of one in my GitHub repo. Although it's in Java, you could easily port it to Python. Note: it uses the "templates" feature in Dataflow, but this can be changed with one line of code.

If Dataflow is off the table, another option could be to use a weird/unused delimiter and read the entire row into BigQuery. Then use SQL/Regex/UDFs to clean/transform/parse it. See here (suggestion from Felipe). We've done this lots of times in the past, and because you're in BigQuery it scales really well.

Upvotes: 2

saifuddin778
saifuddin778

Reputation: 7298

You can always transfer from a storage bucket directly into a BQ table:

bq --location=US load --[no]replace --source_format=CSV dataset.table gs://bucket/file.csv [schema]

Here, [schema] can be an inline schema of your csv file (like id:int,name:string,..) or a path to a JSON schema file (available locally).

As per BQ documentation, they try to parallelize large CSV loads into tables. Of course, there is an upper-bound involved: maximum size of an uncompressed (csv) file to be loaded from GCS to BQ should be <= 5TB, which is way above your requirements. I think you should be good with this.

Upvotes: 0

John Hanley
John Hanley

Reputation: 81454

I would consider using Cloud Dataprep.

Dataprep can import data from GCS, clean / modify the data and export to BigQuery. One of the features that I like is that everything can be done visually / interactively so that I can see how the data transforms.

Start with a subset of your data to see what transformations are required and to give yourself some practice before loading and processing a TB of data.

Upvotes: 2

Related Questions