John Gort
John Gort

Reputation: 21

How can I read from a CSV file from an S3 bucket, apply certain if-statements to it, and write a new updated CSV file and place it in the S3 bucket?

I'm having trouble writing to a new CSV file into an S3 bucket. I want to be able to read a CSV file that I have in an S3 bucket, and if one of the values in the CSV fits a certain requirement, I want to change it to a different value. I've read that it's not possible to edit an S3 object, so I need to create a new one every time. In short, I want to create a new, updated CSV file from another CSV file in an S3 bucket, with changes applied.

I'm trying to use DictWriter and DictReader, but I always run into issues with DictWriter. I can read the CSV file properly, but when I try to update it, there are a myriad of significantly different issues from DictWriter. Right now, the issue that I am getting is that

# Function to be pasted into AWS Lambda.
# Accesses S3 bucket, opens the CSV file, receive the response line-by-line, 

# To be able to access S3 buckets and the objects within the bucket
import boto3

# To be able to read the CSV by using DictReader 
import csv

# Lambda script that extracts, transforms, and loads data from S3 bucket 'testing-bucket-1042' and CSV file 'Insurance.csv'

def lambda_handler(event, context):
    s3 = boto3.resource('s3')
    bucket = s3.Bucket('testing-bucket-1042')
    obj = bucket.Object(key = 'Insurance.csv')
    response = obj.get()
    lines = response['Body'].read().decode('utf-8').split()

    reader = csv.DictReader(lines) 

    with open("s3://testing-bucket-1042/Insurance.csv", newline = '') as csvfile:
            reader = csv.DictReader(csvfile)
            fieldnames = ['county', 'eq_site_limit'] 
            writer = csv.DictWriter(lines, fieldnames=fieldnames)

            for row in reader: 
                writer.writeheader()
                if row['county'] == "CLAY": # if the row is under the column 'county', and contains the string "CLAY"
                    writer.writerow({'county': 'CHANGED'})
                if row['eq_site_limit'] == "0": # if the row is under the column 'eq_site_limit', and contains the string "0"
                    writer.writerow({'eq_site_limit': '9000'})

Right now, the error that I am getting is that the path I use when attempting to open the CSV, "s3://testing-bucket-1042/Insurance.csv", is said to not exist.

The error says

"errorMessage": "[Errno 2] No such file or directory: 's3://testing-bucket-1042/Insurance.csv'", "errorType": "FileNotFoundError"

What would be the correct way to use DictWriter, if at all?

Upvotes: 2

Views: 3717

Answers (3)

Drey
Drey

Reputation: 3364

First of all s3:\\ is not a common (file) protocol and therefore you get your error message. It is good, that you stated your intentions.

Okay, I refactored your code

import codecs

import boto3

# To be able to read the CSV by using DictReader
import csv
from io import StringIO

# Lambda script that extracts, transforms, and loads data from S3 bucket 'testing-bucket-1042' and CSV file 'Insurance.csv'

def lambda_handler(event, context):
    s3 = boto3.resource('s3')
    bucket = s3.Bucket('testing-bucket-1042')
    obj = bucket.Object(key = 'Insurance.csv')
    stream = codecs.getreader('utf-8')(obj.get()['Body'])
    lines = list(csv.DictReader(stream))
    ### now you have your object there

    csv_buffer = StringIO()
    out = csv.DictWriter(csv_buffer, fieldnames=['county', 'eq_site_limit'])

    for row in lines:
        if row['county'] == "CLAY":  
            out.writerow({'county': 'CHANGED'})
        if row['eq_site_limit'] == "0": 
            out.writerow({'eq_site_limit': '9000'})

    ### now write content into some different bucket/key

    s3client = boto3.client('s3')
    s3client.put_object(Body=csv_buffer.getvalue().encode(encoding),
                        Bucket=...targetbucket, Key=...targetkey)

I hope that this works. Basically there are few tricks:

  • use codecs to directly stream csv data from s3 bucket
  • use BytesIO to create a stream in memory to which csv.DictWriter can write to.
  • when you are finished, one way to "upload" your content is through s3.clients's put_object method (as documented in AWS)

Upvotes: 2

Varun Chandak
Varun Chandak

Reputation: 971

You can use streaming functionality of S3 to make changes on the fly. It is better suited for text manipulation tools such as awk and sed.

Example:

aws s3 cp s3://bucketname/file.csv - | sed 's/foo/bar/g' | aws s3 cp -  s3://bucketname/new-file.csv

AWS Docs: https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html

Upvotes: 0

John Rotenstein
John Rotenstein

Reputation: 269490

To logically separate AWS code from business logic, I normally recommend this approach:

  • Download the object from Amazon S3 to the /tmp directory
  • Perform desired business logic (read file, write file)
  • Upload the resulting file to Amazon S3

Using download_file() and upload_file() avoids having to worry about in-memory streams. It means you can take logic that normally operates on files (eg on your own computer) and then apply them to files obtained from S3.

It comes down to personal preference.

Upvotes: 0

Related Questions