terry.zhao
terry.zhao

Reputation: 49

Cost effective BigQuery loading data from s3

I have (2 TB) in 20k files in s3 created over the course of the every day that I need to load to BigQuery to date partition table. Files are rolled over every 5 mins.

What is the most cost effective way to get data to BigQuery? I am looking for cost optimization in both AWS s3 to GCP network egress and actual data loading.

Upvotes: 2

Views: 5630

Answers (4)

bluu
bluu

Reputation: 552

Late 2020 update: you could consider using BigQuery Omni in order to not have to move your data from S3 and still have the BigQuery capabilities you're looking for.

(disclaimer: I'm not affiliated in any way to Google, I just find it remarkable that they've started providing multi-cloud support thanks to Anthos. I hope the other cloud providers will follow suit...)

Upvotes: 2

Adrian
Adrian

Reputation: 2113

BigQuery data ingestion

You have a few options to get your s3 data ingested to BigQuery, all depending on how quickly do you need your data available in BigQuery. Also, any requirements for any data transformation (enrichment, deduplication, aggregation) should be taken into consideration to the overall cost.

The fastest way to get data to BigQuery is streaming API (within the seconds delay), which comes with $0.010 per 200 MB charge. Streaming API Pricing

BigQuery Transfer service is another choice that is the easiest and free of charge. It allows you to schedule data transfer to run it no more than once a day (currently). In your case, where data is continuously produced, that would be the slowest method to get data to BigQuery. Transfer Service Pricing

If you need complex transformation, you may also consider Cloud Dataflow, which is not free of charge. Cloud Dataflow Pricing

Lastly, you may also consider a serverless solution, which is fully event-driven, allowing you data ingestion in close to real-time. With this, you would pay for lambda and cloud function execution, which should be around a few dollars per day plus egress cost.

For data mirroring between AWS S3 and Google Cloud Storage, you could use serverless Cloud Storage Mirror, which comes with payload size optimization with either data compression or dynamic AVRO transcoding.

For getting data loaded to BigQuery, you can use serverless BqTail, which allows you to run loads in batches. To not exceed 1K loads BigQuery quota per table per day, you could comfortably use 90-sec batch window, which would get your data loaded to BigQuery within a few minute's delays in the worst-case scenario. Optionally you can also run data deduplication, data enrichment, and aggregation.

Egress cost consideration

In your scenario, when transfer size is relatively small, 2 TB per day, I would accept egress cost; however, if you expect to grow to 40TB+ per day, you may consider using direct connect to GCP. With a simple proxy, that should come with substantial cost reduction.

Upvotes: 4

guillaume blaquiere
guillaume blaquiere

Reputation: 75950

There is several way for optimizing the transfer and the load.

First of all, the network egress from AWS can't be avoided. If you can, gzip your file before storing them into S3. You will reduce the egress bandwidth and BigQuery can load compressed files.

If your workload that write to S3 can't gzip the file, you have to perform a comparison between the processing time for gzipping the file and the egress cost of not gzipped file.

For GCS, we often speak about cost in GB/month. It's a mistake. When you look at the billing in BigQuery the cost is calculated in Gb/seconds. By the ways, less you let your file on storage, less you pay. By the way if you load your file quickly after the transfert and the load into BigQuery, you will pay almost nothing.

Upvotes: 0

Prashant
Prashant

Reputation: 1180

Google cloud in beta supports a BigQuery Transfer service for S3. Details mentioned here. The other mechanism to use S3 -> GCS -> BigQuery mechanism, which i believe will incur the GCS cost too

As per Google Cloud's pricing docs, it says "no charge" from GC PoV with limits applicable.

For data transfer from S3 to Google CLoud over Internet(i am assuming its not over VPN) is mentioned here . Your data is around 2TB, so the cost as per the table will be $0.09 per GB

Upvotes: 1

Related Questions