jz22
jz22

Reputation: 2638

How to manipulate files stored in S3 without saving them to the server?

I have the following python script that downloads two files from an S3 compatible service. Then merges them and uploads the output to another bucket.

import time
import boto3
import pandas as pd

timestamp = int(time.time())

conn = boto3.client('s3')
conn.download_file('segment', 'segment.csv', 'segment.csv')
conn.download_file('payment', 'payments.csv', 'payments.csv')

paymentsfile = 'payments.csv'
segmentsfile = 'segment.csv'
outputfile = 'payments_merged_' + str(timestamp) + '.csv'

csv_payments = pd.read_csv(paymentsfile, dtype={'ID': float})
csv_segments = pd.read_csv(segmentsfile, dtype={'ID': float})
csv_payments = csv_payments.merge(csv_segments, on='ID')
open(outputfile, 'a').close()
csv_payments.to_csv(outputfile)

conn.upload_file(outputfile, backup, outputfile)

However if I execute the script it stores the files in the folder of my script. For security reasons I would like to prevent this to happen. I could delete the files after the script was executed but let's assume my script is located in the folder /app/script/. This means for a short time, while the script is being executed, someone could open the url example.com/app/script/payments.csv and download the file. What is a good solution for that?

Upvotes: 0

Views: 1047

Answers (2)

mootmoot
mootmoot

Reputation: 13166

In fact, pandas.read_csv let you read a buffer or byte object. You can do everything in the memory. Either put this script in a instance, even better, you can run it as AWS lambda process if the file is small.

import time
import boto3
import pandas as pd

paymentsfile = 'payments.csv'
segmentsfile = 'segment.csv'
outputfile = 'payments_merged_' + str(timestamp) + '.csv'

s3 = boto3.client('s3')
payment_obj = s3.get_object(Bucket='payment', Key=paymentsfile )
segment_obj = s3.get_object(Bucket='segment', Key=segmentsfile )

csv_payments = pd.read_csv(payment_obj['Body'], dtype={'ID': float})
csv_segments = pd.read_csv(segments_obj['Body'], dtype={'ID': float})
csv_merge = csv_payments.merge(csv_segments, on='ID')
csv_merge.to_csv(buffer)
buffer.seek(0)

s3.upload_fileobj(buffer, 'bucket_name', outputfile ) 

Upvotes: 1

BryceH
BryceH

Reputation: 2798

The simplest way would be to modify the configuration of your web server to not serve the directory that you are writing to or write to a directory that isn't served. For example, a common practice is to use /scr for this type of thing. You would need to modify permissions for the user your web server runs under to ensure it has access to /scr.

To restrict web server access to the directory you write to you can use the following in Nginx -

https://serverfault.com/questions/137907/how-to-restrict-access-to-directory-and-subdirs

For Apache you can use this example -

https://serverfault.com/questions/174708/apache2-how-do-i-restrict-access-to-a-directory-but-allow-access-to-one-file-w

Upvotes: 1

Related Questions