Reputation: 21
Currently I am using below code but its taking too much time. As I am converting dask dataframe to buffer and using multipart-upload to upload it in s3
def multi_part_upload_with_s3(file_buffer_obj,BUCKET_NAME,key_path):
client = boto3.client('s3')
s3 = boto3.resource('s3')
config = TransferConfig(multipart_threshold=1024 *25,max_concurrency=10,multipart_chunksize=1024 * 25,use_threads=True)
s3.meta.client.upload_fileobj(file_buffer_obj, BUCKET_NAME, key_path,Config=config)
ddf.compute().to_csv(target_buffer_old,sep=",")
target_buffer_old=io.BytesIO(target_buffer_old.getvalue().encode())
multi_part_upload_with_s3(target_buffer_old,"bucket","key/file.csv")
Upvotes: 2
Views: 778
Reputation: 28683
I advise you to write to separate S3 files in parallel using dask (which is the default way to work) and then use multi-part-upload to merge together the outputs. You could use the s3fs
method merge
to do this. Note that you will want to write without headers.
Upvotes: 1