Reputation: 255
Is there a way to extract the complete BigQuery partitioned table with one command so that data of each partition is extracted into a separate folder of the format part_col=date_yyyy-mm-dd
Since Bigquery partitioned table can read files from the hive type partitioned directories, is there a way to extract the data in a similar way. I can extract each partition separately, however that is very cumbersome when i an extracting a lot of partitions
Upvotes: 5
Views: 5672
Reputation: 514
I created a script to iterate on a date partition if it helps. it would be easy to convert to using parameters
#!/bin/bash
input_start=2018-1-1
input_end=2018-3-1
project=myproject
dataset=mydataset
table=table_of_stuff
startdate=$(date -I -d "$input_start") || exit -1
enddate=$(date -I -d "$input_end") || exit -1
d="$startdate"
while [[ "$d" < "$enddate" ]]; do
year_val=$(date --date="$d" +%Y)
mon_val=$(date --date="$d" +%m)
day_val=$(date --date="$d" +%d)
echo bq extract --location=US --destination_format PARQUET --compression SNAPPY $project:$dataset.$table\$$year_val$mon_val$day_val gs://your_bucket/table_archive/$year_val/$dataset/$table/date_col=$year_val-$mon_val-$day_val/*.parquet
d=$(date -I -d "$d + 1 day")
done
Upvotes: 0
Reputation: 26
Set the project as test_dataset
using gcloud init
before running the below command.
bq extract --destination_format=CSV 'test_partitiontime$20210716' gs://testbucket/20210716/test*.csv
This will create a folder with the name 20210716
inside testbucket
and write the file there.
Upvotes: 0
Reputation: 1452
You could do this programmatically. For instance, you can export partitioned data by using the partition decorator such as table$20190801. And then on the bq extract command you can use URI Patterns (look the example of the workers pattern) for the GCS objects.
Since all objects will be within the same bucket, the folders are just an hierarchical illusion, so you can specify URI patterns on the folders as well, but not on the bucket.
So you would do a script where you loop over the DATE value, with something like:
bq extract
--destination_format [CSV, NEWLINE_DELIMITED_JSON, AVRO]
--compression [GZIP, AVRO supports DEFLATE and SNAPPY]
--field_delimiter [DELIMITER]
--print_header [true, false]
[PROJECT_ID]:[DATASET].[TABLE]$[DATE]
gs://[BUCKET]/part_col=[DATE]/[FILENAME]-*.[csv, json, avro]
You can't do it automatically with just a bq command. For this it would be better to raise a feature request as suggested by Felipe.
Upvotes: 8