Reputation: 581
I'm trying to move my files from a google cloud bucket to a vm instance. Let me first make sure that this is the right strategy for what I'm trying to accomplish. I have 400 gigs of data and it takes an incredible amout of time just to open the files. I need to do some parallel processing. My laptop, I think, only allows up to four parallel processing units at a time.
First, I don't think it's possible, but just in case it is, I would like to read the files on my cloud bucket without transferring them to a VM instance. I only believe that this is possible if the analogy of laptop to external harddrive is similar to VM instance to cloud bucket. If this is not possible then I have to download the files from a cloud bucket.
I tried using the following code:
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
print('Blob {} downloaded to {}.'.format(
source_blob_name,
destination_file_name))
download_blob('as_lists1', '42.pkl', "kylefoley@instance-1:/home/kylefoley/42.pkl")
No error message was thrown. But when I tried to list the contents on the instance-1 harddrive, 42.pkl
did not come up, as follows:
kylefoley@instance-1:~$ ls
distraction.txt env hey you2.txt you.txt
kylefoley@instance-1:~$ pwd
/home/kylefoley
Also, does anyone know whose bandwidth is used when I do that transfer? If its the bandwidth I pay for then there is no point in splitting the transfer to several computers. If is someone else's bandwidth then it would be a good idea to split the data into parts and transfer each data set to a different computer at the same time.
Upvotes: 0
Views: 1088
Reputation: 4961
The easiest way to copy the content from your bucket to a GCP VM instance is using the command gsutil cp -r gs://Your_Bucket/* ./
Please ensure to give the proper permissions to your service account to access files from your bucket or to make your bucket public.
You can give Storage Object Admin, Creator or Viewer depending on the needs of your project.
You can also use python to download your files. Here is a example file that it's working for me:
from google.cloud import storage
if __name__ == '__main__':
bucket_name = 'your_bucket'
source_blob_name = 'your_object'
destination_file_name = 'local_file'
#DOWNLOAD
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
print('Blob {} downloaded to {}.'.format(source_blob_name, destination_file_name))
Also regarding to your other question there is a theoretical maximum bandwidth speed of 2 Gbits/second (Gbps) cap for peak performance. You can speed up the process by using ssd attached to your instance.
Upvotes: 2