How to convert S3 bucket content(.csv format) into a dataframe in AWS Lambda

Question

I am trying to ingest S3 data(csv file) to RDS(MSSQL) through lambda. Sample code:

s3 = boto3.client('s3')
     if event:
        file_obj = event["Records"][0]
        bucketname = str(file_obj["s3"]["bucket"]["name"])
        csv_filename = unquote_plus(str(file_obj["s3"]["object"]["key"]))
        print("Filename: ", csv_filename)
        csv_fileObj = s3.get_object(Bucket=bucketname, Key=csv_filename)
        file_content = csv_fileObj["Body"].read().decode("utf-8").split()

I have tried put my csv contents into a list but didnt work.

 results = []
        for row in csv.DictReader(file_content):
         results.append(row.values())
        print(results)
        print(file_content)
        return {
           'statusCode': 200,
           'body': json.dumps('S3 file processed')
         }

Is there anyway I could convert "file_content" into a dataframe in Lambda? I have multiple columns to load.

Later I would follow this approach to load the data into RDS

import pyodbc
import pandas as pd
# insert data from csv file into dataframe(df).
server = 'yourservername' 
database = 'AdventureWorks' 
username = 'username' 
password = 'yourpassword' 
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
# Insert Dataframe into SQL Server:
for index, row in df.iterrows():
     cursor.execute("INSERT INTO HumanResources.DepartmentTest (DepartmentID,Name,GroupName) values(?,?,?)", row.DepartmentID, row.Name, row.GroupName)
cnxn.commit()
cursor.close()

Can anyone suggest how to go about it?

Simon Hawe · Accepted Answer

You can use io.BytesIO to get the bytes data into memory and after that use pandasread_csv to transform it into a dataframe. Note that there is some strange SSL download limit for dataframes that will lead to issue when downloading data > 2GB. That is why I have used this chunking in the code below.

import io
obj = s3.get_object(Bucket=bucketname, Key=csv_filename)
# This should prevent the 2GB download limit from a python ssl internal
chunks = (chunk for chunk in obj["Body"].iter_chunks(chunk_size=1024**3))
data = io.BytesIO(b"".join(chunks)) # This keeps everything fully in memory
df = pd.read_csv(data) # here you can provide also some necessary args and kwargs

How to convert S3 bucket content(.csv format) into a dataframe in AWS Lambda

Answers (2)

Related Questions