morepenguins
morepenguins

Reputation: 1277

Best way to run bash script on Google Cloud to bulk download to Bucket

I am very new to using Google cloud and cloud servers, and I am stuck on a very basic question.

I would like to bulk download some ~60,000 csv.gz files from an internet server (with permission). I compiled a bunch of curl scripts that pipe into a gsutil that uploads to my bucket into an .sh file that looks like the following.

curl http://internet.address/csvs/file1.csv.gz | gsutil cp - gs://my_bucket/file1.csv.gz
curl http://internet.address/csvs/file2.csv.gz | gsutil cp - gs://my_bucket/file2.csv.gz
...
curl http://internet.address/csvs/file60000.csv.gz | gsutil cp - gs://my_bucket/file60000.csv.gz

However this will take ~10 days if I run from my machine, so I'd like to run it from the cloud directly. I do not know the best way to do this. This is too long of a process to use the Cloud Shell directly, and I'm not sure what other app on the Cloud is the best way to run an .sh script that downloads to a Cloud Bucket, or if this type of .sh script is the most efficient method to go about bulk downloading files from the internet using the apps on Google Cloud.

I've seen some advice to use SDK, which I've installed on my local machine, but I don't even know where to start with that.

Any help with this is greatly appreciated!

Upvotes: 4

Views: 2646

Answers (2)

Jan Hernandez
Jan Hernandez

Reputation: 4620

Gcloud and Cloud Storage doesn't offer the possibility to grab objects from internet and copy these directly on a bucket without intermediary (computer,server or cloud application).

Regarding which Cloud service can help you for run a bash script, you can use a GCE always free F1-micro instance VM (1 instance free per billing account)

To improve the upload files to a bucket, you can use GNU parrallel to run multiple Curl Commands at the same time and improve the time to complete this task.

To install parallel on ubuntu/debian run this command:

sudo apt-get install parallel

For example you can create a file called downloads with the commands that you want to parallelize (you must write all curl commands in the file)

downloads file

curl http://internet.address/csvs/file1.csv.gz | gsutil cp - gs://my_bucket/file1.csv.gz
curl http://internet.address/csvs/file2.csv.gz | gsutil cp - gs://my_bucket/file2.csv.gz
curl http://internet.address/csvs/file3.csv.gz | gsutil cp - gs://my_bucket/file3.csv.gz
curl http://internet.address/csvs/file4.csv.gz | gsutil cp - gs://my_bucket/file4.csv.gz
curl http://internet.address/csvs/file5.csv.gz | gsutil cp - gs://my_bucket/file5.csv.gz
curl http://internet.address/csvs/file6.csv.gz | gsutil cp - gs://my_bucket/file6.csv.gz

After that, you simply need to run the following command

parallel --job 2 < downloads

This command will run up to 2 parallel curl commands until all the commands in the file have been executed.

Another improvement you can apply to your routine is to use gsutil mv instead gsutil cp, mv command will delete the file after success upload, this can help you to save space on your hard drive.

Upvotes: 3

mhouglum
mhouglum

Reputation: 2593

If you have the MD5 hashes of each CSV file, you could use the Storage Transfer Service, which supports copying a list of files (that must be publicly accessible via HTTP[S] URLs) to your desired GCS bucket. See the Transfer Service docs on URL lists.

Upvotes: 0

Related Questions