Gwendal Yviquel
Gwendal Yviquel

Reputation: 392

Export Bigquery table to gcs bucket into multiple folders/files corresponding to clusters

Due to loading time and query cost, I need to export a bigquery table to multiple Google Cloud Storages folders within a bucket.

I currently use ExtractJobConfig from the bigquery python client with the wildcard operator to create multiple files. But I need to create a folder for every nomenclature value (it is within a bigquery table column), and then create the multiple files.

The table is pretty huge and won't fit (could but that's not the idea) the ram, it is 1+ Tb. I cannot dummy loop over it with python.

I read quite a lot of documentation, parsed the parameters, but I can't find a clean solution. Did a miss something or there is no google solution?

My B plan is to us apache beam and dataflow, but I have not skills yet, and I would like to avoid this solution as much as possible for simplicity and maintenance.

Upvotes: 0

Views: 906

Answers (1)

guillaume blaquiere
guillaume blaquiere

Reputation: 75810

You have 2 solutions:

  • Create 1 export query per aggregation. If you have 100 nomenclature value, query 100 times the table and export the data in the target directory. The issue is the cost: you will pay the 100 processing of the table.
  • You can use Apache Beam to extract the data and to sort them. Then, with a dynamic destination, you will be able to create all the GCS path that you want. The issue is that it requires skill with Apache Beam to achieve it.

You have an extra solution, similar to the 2nd one, but you can use Spark, and especially Spark serverless to achieve it. If you have more skill in spark than in apache Beam, it could be more efficient.

Upvotes: 3

Related Questions