Reputation: 1041
I'm trying to ingest data from https://dumps.wikimedia.org/enwiki/20201001/
which is the Wiki dumps into Azure Blob Storage using Python.
The file size are around 200-300 MB each but the point is there is so many files and the total size is more than 50 GB.
I don't want to jeopardize my local laptop's storage so I don't want to download the files to the local drive then upload them to Blob Storage.
Is there any option that I can stream the files from the ULRs to the Blob Storage directly?
Upvotes: 0
Views: 177
Reputation: 29940
If you're using the package azure-storage-blob 12.5.0, you can directly use the start_copy_from_url
method. Note that you need to use this method to copy each file at a time.
Here is the sample code:
from azure.storage.blob import BlobServiceClient
CONNECTION_STRING="DefaultEndpointsProtocol=https;AccountName=xxx;AccountKey=xxx;EndpointSuffix=core.windows.net"
def run_sample():
blob_service_client = BlobServiceClient.from_connection_string(CONNECTION_STRING)
source_blob = "http://www.gutenberg.org/files/59466/59466-0.txt"
copied_blob = blob_service_client.get_blob_client("your_container_name", '59466-0.txt')
#note: the method returns immediately when the copy is in progress, you need to check the copy status as per the official doc mentioned below.
copied_blob.start_copy_from_url(source_blob)
if __name__ == "__main__":
run_sample()
For more details, please refer to the completed sample in github.
Upvotes: 0
Reputation: 5403
You could create a Data Factory which supports REST API as a source type and blob storage as a sink.
Upvotes: 0