Reputation: 47
I have a schema stored in Big Query and I need to transfer that to the amazon s3 bucket ..What is the process ?.. The BigQuery data is of other organization who is willing to give proper access to transfer the data. I found ways to import from s3 bucket to gcp but no way to export from big query to s3 directly..I m really stuck in this part. Also I need to schedule this process as the data in bigquery is changing and I constantly want data to my s3 bucket everyday . Please mention references for that. Note that per days data is gonna be in TBs
Upvotes: 3
Views: 6292
Reputation: 3341
You can use bigquery connections to export to S3 directly using a bigquery query. Here is the documentation for this feature: https://cloud.google.com/bigquery/docs/reference/standard-sql/export-statements
Sample code:
EXPORT DATA
WITH CONNECTION myproject.us.myconnection
OPTIONS(
uri='s3://bucket/folder/*',
format='JSON',
overwrite=true) AS
SELECT field1, field2 FROM mydataset.table1 ORDER BY field1 LIMIT 10
Upvotes: 0
Reputation: 316
I guess this is possible directly using the Export query: Ref: Transferring BigQuery data to amazon s3 bucket
EXPORT DATA WITH CONNECTION CONNECTION_REGION.CONNECTION_NAME
OPTIONS(uri="s3://BUCKET_NAME/PATH", format="FORMAT", ...)
AS QUERY
Upvotes: 0
Reputation: 601
AFAIK there is no direct bigquery to s3 pipeline but the work around is BigQuery to Google cloud storage then to s3
First step is to setup gcloud sdk on the machine where the job would be scheduled.
Configure gcloud and pay special attention to boto configuration you can skip this step if the process is going to run on an EC2 with a role attached that allows it to write to S3 (also a better security practice than using keys)
copy data from bigquery to gcs
bq --location=US extract --destination_format AVRO --compression SNAPPY [source_table] [target location on gcs]
copy the file from gcs to s3
gsutil rsync -r [target location on gcs] to [target location on s3]
Note: this process works fine for GA360 data because it is already partitioned by day but if this is not the case and you can't separate the tables then you can schedule a query on GCP to create the table with incremental logic that will be used as the [source table]
update AWS released a BQ glue connector that can do that
Upvotes: 5