datalifenyc
datalifenyc

Reputation: 2248

What is the best approach to sync data from AWS 3 bucket to Azure Data Lake Gen 2

Currently, I download csv files from AWS S3 to my local computer using: aws s3 sync s3://<cloud_source> c:/<local_destination> --profile aws_profile. Now, I would like to use the same process to sync the files from AWS to Azure Data Lake Storage Gen2 (one-way sync) on a daily basis. [Note: I only have read/download permissions for the S3 data source.]

I thought about 5 potential paths to solving this problem:

  1. Use AWS CLI commands within Azure. I'm not entirely sure how to do that without running an Azure VM. Also, I would like to have my AWS profile credentials persist?
  2. Use Python's subprocess library to run AWS CLI commands. I run into similar issues as option 1, namely a) maintaining a persistent install of AWS CLI, b) passing AWS profile credentials, and c) running without an Azure VM.
  3. Use Python's Boto3 library to access AWS services. In the past, it appears that Boto3 didn't support the AWS sync command. So, developers like @raydel-miranda developed their own. [see Sync two buckets through boto3]. However, it now appears that there is a DataSync class for Boto3. [see DataSync | Boto3 Docs 1.17.27 documentation]. Would I still need to run this in an Azure VM or could I use Azure Data Factory?
  4. Use Azure Data Factory to copy data from AWS S3 bucket. [see Copy data from Amazon Simple Storage Service by using Azure Data Factory] My concern would be that I would want to sync rather than copy. I believe Azure Data Factory has functionality to check if a file already exists, but what if the file has been deleted from AWS S3 data source?
  5. Use Azure Data Science Virtual Machine to: a) install the AWS CLI, 2) create my AWS profile to store the access credentials, and 3) run the aws s3 sync... command.

Any tips, suggestions, or ideas on automating this process are greatly appreciated.

Upvotes: 1

Views: 2523

Answers (2)

Vishwas Lele
Vishwas Lele

Reputation: 46

AzReplciate is another option - especially for very large containers https://learn.microsoft.com/en-us/samples/azure/azreplicate/azreplicate/

Upvotes: 0

Himanshu Kumar Sinha
Himanshu Kumar Sinha

Reputation: 1806

Adding one more to the list :) 6. Please do also look into Azcopy option . https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-s3?toc=/azure/storage/blobs/toc.json

I am not aware of any tool which helps in syncing the data , more or less all will do the copy , I think you will have to implement that . Couple of quick thoughts . #3 ) You can run this from a batch service . You can initate that from Azure data factory . Also since are talking about Python , you can also run that from Azure data bricks .

#4) ADF does not have any sync logic for the files to be deleted. We can implement that using the getMetadat activity . https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity

Upvotes: 0

Related Questions