Reputation: 2248
Currently, I download csv files from AWS S3 to my local computer using:
aws s3 sync s3://<cloud_source> c:/<local_destination> --profile aws_profile
. Now, I would like to use the same process to sync the files from AWS to Azure Data Lake Storage Gen2 (one-way sync) on a daily basis. [Note: I only have read/download permissions for the S3 data source.]
I thought about 5 potential paths to solving this problem:
subprocess
library to run AWS CLI commands. I run into similar issues as option 1, namely a) maintaining a persistent install of AWS CLI, b) passing AWS profile credentials, and c) running without an Azure VM.sync
command. So, developers like @raydel-miranda developed their own. [see Sync two buckets through boto3]. However, it now appears that there is a DataSync
class for Boto3. [see DataSync | Boto3 Docs 1.17.27 documentation]. Would I still need to run this in an Azure VM or could I use Azure Data Factory?aws s3 sync...
command.Any tips, suggestions, or ideas on automating this process are greatly appreciated.
Upvotes: 1
Views: 2523
Reputation: 46
AzReplciate is another option - especially for very large containers https://learn.microsoft.com/en-us/samples/azure/azreplicate/azreplicate/
Upvotes: 0
Reputation: 1806
Adding one more to the list :) 6. Please do also look into Azcopy option . https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-s3?toc=/azure/storage/blobs/toc.json
I am not aware of any tool which helps in syncing the data , more or less all will do the copy , I think you will have to implement that . Couple of quick thoughts . #3 ) You can run this from a batch service . You can initate that from Azure data factory . Also since are talking about Python , you can also run that from Azure data bricks .
#4) ADF does not have any sync logic for the files to be deleted. We can implement that using the getMetadat activity . https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity
Upvotes: 0