How to manipulate files stored in S3 without saving them to the server?

Question

I have the following python script that downloads two files from an S3 compatible service. Then merges them and uploads the output to another bucket.

import time
import boto3
import pandas as pd

timestamp = int(time.time())

conn = boto3.client('s3')
conn.download_file('segment', 'segment.csv', 'segment.csv')
conn.download_file('payment', 'payments.csv', 'payments.csv')

paymentsfile = 'payments.csv'
segmentsfile = 'segment.csv'
outputfile = 'payments_merged_' + str(timestamp) + '.csv'

csv_payments = pd.read_csv(paymentsfile, dtype={'ID': float})
csv_segments = pd.read_csv(segmentsfile, dtype={'ID': float})
csv_payments = csv_payments.merge(csv_segments, on='ID')
open(outputfile, 'a').close()
csv_payments.to_csv(outputfile)

conn.upload_file(outputfile, backup, outputfile)

However if I execute the script it stores the files in the folder of my script. For security reasons I would like to prevent this to happen. I could delete the files after the script was executed but let's assume my script is located in the folder /app/script/. This means for a short time, while the script is being executed, someone could open the url example.com/app/script/payments.csv and download the file. What is a good solution for that?

mootmoot · Accepted Answer

In fact, pandas.read_csv let you read a buffer or byte object. You can do everything in the memory. Either put this script in a instance, even better, you can run it as AWS lambda process if the file is small.

import time
import boto3
import pandas as pd

paymentsfile = 'payments.csv'
segmentsfile = 'segment.csv'
outputfile = 'payments_merged_' + str(timestamp) + '.csv'

s3 = boto3.client('s3')
payment_obj = s3.get_object(Bucket='payment', Key=paymentsfile )
segment_obj = s3.get_object(Bucket='segment', Key=segmentsfile )

csv_payments = pd.read_csv(payment_obj['Body'], dtype={'ID': float})
csv_segments = pd.read_csv(segments_obj['Body'], dtype={'ID': float})
csv_merge = csv_payments.merge(csv_segments, on='ID')
csv_merge.to_csv(buffer)
buffer.seek(0)

s3.upload_fileobj(buffer, 'bucket_name', outputfile )

How to manipulate files stored in S3 without saving them to the server?

Answers (2)

Related Questions