Reputation: 21
I'm having trouble writing to a new CSV file into an S3 bucket. I want to be able to read a CSV file that I have in an S3 bucket, and if one of the values in the CSV fits a certain requirement, I want to change it to a different value. I've read that it's not possible to edit an S3 object, so I need to create a new one every time. In short, I want to create a new, updated CSV file from another CSV file in an S3 bucket, with changes applied.
I'm trying to use DictWriter and DictReader, but I always run into issues with DictWriter. I can read the CSV file properly, but when I try to update it, there are a myriad of significantly different issues from DictWriter. Right now, the issue that I am getting is that
# Function to be pasted into AWS Lambda.
# Accesses S3 bucket, opens the CSV file, receive the response line-by-line,
# To be able to access S3 buckets and the objects within the bucket
import boto3
# To be able to read the CSV by using DictReader
import csv
# Lambda script that extracts, transforms, and loads data from S3 bucket 'testing-bucket-1042' and CSV file 'Insurance.csv'
def lambda_handler(event, context):
s3 = boto3.resource('s3')
bucket = s3.Bucket('testing-bucket-1042')
obj = bucket.Object(key = 'Insurance.csv')
response = obj.get()
lines = response['Body'].read().decode('utf-8').split()
reader = csv.DictReader(lines)
with open("s3://testing-bucket-1042/Insurance.csv", newline = '') as csvfile:
reader = csv.DictReader(csvfile)
fieldnames = ['county', 'eq_site_limit']
writer = csv.DictWriter(lines, fieldnames=fieldnames)
for row in reader:
writer.writeheader()
if row['county'] == "CLAY": # if the row is under the column 'county', and contains the string "CLAY"
writer.writerow({'county': 'CHANGED'})
if row['eq_site_limit'] == "0": # if the row is under the column 'eq_site_limit', and contains the string "0"
writer.writerow({'eq_site_limit': '9000'})
Right now, the error that I am getting is that the path I use when attempting to open the CSV, "s3://testing-bucket-1042/Insurance.csv", is said to not exist.
The error says
"errorMessage": "[Errno 2] No such file or directory: 's3://testing-bucket-1042/Insurance.csv'", "errorType": "FileNotFoundError"
What would be the correct way to use DictWriter, if at all?
Upvotes: 2
Views: 3717
Reputation: 3364
First of all s3:\\
is not a common (file) protocol and therefore you get your error message. It is good, that you stated your intentions.
Okay, I refactored your code
import codecs
import boto3
# To be able to read the CSV by using DictReader
import csv
from io import StringIO
# Lambda script that extracts, transforms, and loads data from S3 bucket 'testing-bucket-1042' and CSV file 'Insurance.csv'
def lambda_handler(event, context):
s3 = boto3.resource('s3')
bucket = s3.Bucket('testing-bucket-1042')
obj = bucket.Object(key = 'Insurance.csv')
stream = codecs.getreader('utf-8')(obj.get()['Body'])
lines = list(csv.DictReader(stream))
### now you have your object there
csv_buffer = StringIO()
out = csv.DictWriter(csv_buffer, fieldnames=['county', 'eq_site_limit'])
for row in lines:
if row['county'] == "CLAY":
out.writerow({'county': 'CHANGED'})
if row['eq_site_limit'] == "0":
out.writerow({'eq_site_limit': '9000'})
### now write content into some different bucket/key
s3client = boto3.client('s3')
s3client.put_object(Body=csv_buffer.getvalue().encode(encoding),
Bucket=...targetbucket, Key=...targetkey)
I hope that this works. Basically there are few tricks:
codecs
to directly stream csv data from s3 bucketBytesIO
to create a stream in memory to which csv.DictWriter
can write to. s3.clients
's put_object
method (as documented in AWS)Upvotes: 2
Reputation: 971
You can use streaming functionality of S3 to make changes on the fly. It is better suited for text manipulation tools such as awk
and sed
.
Example:
aws s3 cp s3://bucketname/file.csv - | sed 's/foo/bar/g' | aws s3 cp - s3://bucketname/new-file.csv
AWS Docs: https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html
Upvotes: 0
Reputation: 269490
To logically separate AWS code from business logic, I normally recommend this approach:
/tmp
directoryUsing download_file()
and upload_file()
avoids having to worry about in-memory streams. It means you can take logic that normally operates on files (eg on your own computer) and then apply them to files obtained from S3.
It comes down to personal preference.
Upvotes: 0