Ingest data into Blob Storage from downloadable URLs without having to download the files

Question

I'm trying to ingest data from https://dumps.wikimedia.org/enwiki/20201001/ which is the Wiki dumps into Azure Blob Storage using Python.

The file size are around 200-300 MB each but the point is there is so many files and the total size is more than 50 GB.

I don't want to jeopardize my local laptop's storage so I don't want to download the files to the local drive then upload them to Blob Storage.

Is there any option that I can stream the files from the ULRs to the Blob Storage directly?

Ivan Glasenberg · Accepted Answer

If you're using the package azure-storage-blob 12.5.0, you can directly use the start_copy_from_url method. Note that you need to use this method to copy each file at a time.

Here is the sample code:

from azure.storage.blob import BlobServiceClient

CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=xxx;AccountKey=xxx;EndpointSuffix=core.windows.net"

def run_sample():
    blob_service_client = BlobServiceClient.from_connection_string(CONNECTION_STRING)
    source_blob = "http://www.gutenberg.org/files/59466/59466-0.txt"
    copied_blob = blob_service_client.get_blob_client("your_container_name", '59466-0.txt')
    
    #note: the method returns immediately when the copy is in progress, you need to check the copy status as per the official doc mentioned below.
    copied_blob.start_copy_from_url(source_blob)

if __name__ == "__main__":
    run_sample()

For more details, please refer to the completed sample in github.

Ingest data into Blob Storage from downloadable URLs without having to download the files

Answers (2)

Related Questions