Trishit Ghosh
Trishit Ghosh

Reputation: 255

Extract BigQuery partitioned table

Is there a way to extract the complete BigQuery partitioned table with one command so that data of each partition is extracted into a separate folder of the format part_col=date_yyyy-mm-dd

Since Bigquery partitioned table can read files from the hive type partitioned directories, is there a way to extract the data in a similar way. I can extract each partition separately, however that is very cumbersome when i an extracting a lot of partitions

Upvotes: 5

Views: 5672

Answers (3)

DanJGer
DanJGer

Reputation: 514

I created a script to iterate on a date partition if it helps. it would be easy to convert to using parameters

#!/bin/bash
input_start=2018-1-1
input_end=2018-3-1
project=myproject
dataset=mydataset
table=table_of_stuff

startdate=$(date -I -d "$input_start") || exit -1
enddate=$(date -I -d "$input_end")     || exit -1

d="$startdate"
while [[ "$d" < "$enddate" ]]; do
        year_val=$(date --date="$d" +%Y)
        mon_val=$(date --date="$d" +%m)
        day_val=$(date --date="$d" +%d)
  echo bq extract --location=US --destination_format PARQUET --compression SNAPPY $project:$dataset.$table\$$year_val$mon_val$day_val gs://your_bucket/table_archive/$year_val/$dataset/$table/date_col=$year_val-$mon_val-$day_val/*.parquet
  d=$(date -I -d "$d + 1 day")
done

Upvotes: 0

amit shrivastava
amit shrivastava

Reputation: 26

Set the project as test_dataset using gcloud init before running the below command.

bq extract --destination_format=CSV 'test_partitiontime$20210716' gs://testbucket/20210716/test*.csv

This will create a folder with the name 20210716 inside testbucket and write the file there.

Upvotes: 0

H&#233;ctor Neri
H&#233;ctor Neri

Reputation: 1452

You could do this programmatically. For instance, you can export partitioned data by using the partition decorator such as table$20190801. And then on the bq extract command you can use URI Patterns (look the example of the workers pattern) for the GCS objects.

Since all objects will be within the same bucket, the folders are just an hierarchical illusion, so you can specify URI patterns on the folders as well, but not on the bucket.

So you would do a script where you loop over the DATE value, with something like:

bq extract 
--destination_format [CSV, NEWLINE_DELIMITED_JSON, AVRO] 
--compression [GZIP, AVRO supports DEFLATE and SNAPPY] 
--field_delimiter [DELIMITER] 
--print_header [true, false] 
[PROJECT_ID]:[DATASET].[TABLE]$[DATE]
gs://[BUCKET]/part_col=[DATE]/[FILENAME]-*.[csv, json, avro]

You can't do it automatically with just a bq command. For this it would be better to raise a feature request as suggested by Felipe.

Upvotes: 8

Related Questions