Reputation: 1277
I am very new to using Google cloud and cloud servers, and I am stuck on a very basic question.
I would like to bulk download some ~60,000 csv.gz files from an internet server (with permission). I compiled a bunch of curl
scripts that pipe into a gsutil
that uploads to my bucket into an .sh
file that looks like the following.
curl http://internet.address/csvs/file1.csv.gz | gsutil cp - gs://my_bucket/file1.csv.gz
curl http://internet.address/csvs/file2.csv.gz | gsutil cp - gs://my_bucket/file2.csv.gz
...
curl http://internet.address/csvs/file60000.csv.gz | gsutil cp - gs://my_bucket/file60000.csv.gz
However this will take ~10 days if I run from my machine, so I'd like to run it from the cloud directly. I do not know the best way to do this. This is too long of a process to use the Cloud Shell directly, and I'm not sure what other app on the Cloud is the best way to run an .sh
script that downloads to a Cloud Bucket, or if this type of .sh
script is the most efficient method to go about bulk downloading files from the internet using the apps on Google Cloud.
I've seen some advice to use SDK, which I've installed on my local machine, but I don't even know where to start with that.
Any help with this is greatly appreciated!
Upvotes: 4
Views: 2646
Reputation: 4620
Gcloud and Cloud Storage doesn't offer the possibility to grab objects from internet and copy these directly on a bucket without intermediary (computer,server or cloud application).
Regarding which Cloud service can help you for run a bash script, you can use a GCE always free F1-micro instance VM (1 instance free per billing account)
To improve the upload files to a bucket, you can use GNU parrallel to run multiple Curl
Commands at the same time and improve the time to complete this task.
To install parallel on ubuntu/debian run this command:
sudo apt-get install parallel
For example you can create a file called downloads
with the commands that you want to parallelize (you must write all curl commands in the file)
downloads file
curl http://internet.address/csvs/file1.csv.gz | gsutil cp - gs://my_bucket/file1.csv.gz
curl http://internet.address/csvs/file2.csv.gz | gsutil cp - gs://my_bucket/file2.csv.gz
curl http://internet.address/csvs/file3.csv.gz | gsutil cp - gs://my_bucket/file3.csv.gz
curl http://internet.address/csvs/file4.csv.gz | gsutil cp - gs://my_bucket/file4.csv.gz
curl http://internet.address/csvs/file5.csv.gz | gsutil cp - gs://my_bucket/file5.csv.gz
curl http://internet.address/csvs/file6.csv.gz | gsutil cp - gs://my_bucket/file6.csv.gz
After that, you simply need to run the following command
parallel --job 2 < downloads
This command will run up to 2 parallel curl commands until all the commands in the file have been executed.
Another improvement you can apply to your routine is to use gsutil mv
instead gsutil cp
, mv
command will delete the file after success upload, this can help you to save space on your hard drive.
Upvotes: 3
Reputation: 2593
If you have the MD5 hashes of each CSV file, you could use the Storage Transfer Service, which supports copying a list of files (that must be publicly accessible via HTTP[S] URLs) to your desired GCS bucket. See the Transfer Service docs on URL lists.
Upvotes: 0