user1672455
user1672455

Reputation: 172

minimizing the cost of uploading a very large tar file to Google Cloud Storage

I'm currently trying upload and then untar a very large file (1.3 tb) into Google Cloud Storage at the lowest price.

I initially thought about creating a really cheap instance just to download the file and put it in a bucket, then creating a new instance with a good amount of RAM to untar the file and then put the result in a new bucket. However since the bucket price depends on the nbr of request I/O I'm not sure it's the best option, and even for performance it might not be the best.

What would be the best strategy to untar the file in the cheapest way?

Upvotes: 4

Views: 2441

Answers (1)

Dan
Dan

Reputation: 7737

First some background information on pricing:

Google has pretty good documentation about how to ingest data into GCS. From that guide:

Today, when you move data to Cloud Storage, there are no ingress traffic charges. The gsutil tool and the Storage Transfer Service are both offered at no charge. See the GCP network pricing page for the most up-to-date pricing details.

The "network pricing page" just says:

[Traffic type: Ingress] Price: No charge, unless there is a resource such as a load balancer that is processing ingress traffic. Responses to requests count as egress and are charged.

There is additional information on the GCS pricing page about your idea to use a GCE VM to write to GCS:

There are no network charges for accessing data in your Cloud Storage buckets when you do so with other GCP services in the following scenarios:

  • Your bucket and GCP service are located in the same multi-regional or regional location. For example, accessing data in an asia-east1 bucket with an asia-east1 Compute Engine instance.

From later in that same page, there is also information about the pre-request pricing:

Class A Operations: storage.*.insert[1]

[1] Simple, multipart, and resumable uploads with the JSON API are each considered one Class A operation.

The cost for Class A operations is per 10,000 operations, and is either $0.05 or $0.10 depending on the storage type. I believe you would only be doing 1 Class A operation (or at most, 1 Class A operation per file that you upload), so this probably wouldn't add up to much usage overall.


Now to answer your question:

For your use case, it sounds like you want to have the files in the tarball be individual files in GCS (as opposed to just having a big tarball stored in one file in GCS). The first step is to untar it somewhere, and the second step is to use gsutil cp to copy it to GCS.

Unless you have to (i.e. not enough space on the machine that holds the tarball now), I wouldn't recommend copying the tarball to an intermediate VM in GCE before uploading to GCE, for two reasons:

  1. gsutil cp already handles a bunch of annoying edge cases for you: parallel uploads, resuming an upload in case there's a network failure, retries, checksum comparisons, etc.
  2. Using any GCE VMs will add cost to this whole copy operation -- costs for the disks plus costs for the VMs themselves.

If you want to try the procedure out with something lower-risk first, make a small directory with a few megabytes of data and a few files and use gsutil cp to copy it, then check how much you were charged for that. From the GCS pricing page:

Charges accrue daily, but Cloud Storage bills you only at the end of the billing period. You can view unbilled usage in your project's billing page in the Google Cloud Platform Console.

So you'd just have to wait a day to see how much you were billed.

Upvotes: 3

Related Questions