adey27
adey27

Reputation: 469

How to convert S3 bucket content(.csv format) into a dataframe in AWS Lambda

I am trying to ingest S3 data(csv file) to RDS(MSSQL) through lambda. Sample code:

s3 = boto3.client('s3')
     if event:
        file_obj = event["Records"][0]
        bucketname = str(file_obj["s3"]["bucket"]["name"])
        csv_filename = unquote_plus(str(file_obj["s3"]["object"]["key"]))
        print("Filename: ", csv_filename)
        csv_fileObj = s3.get_object(Bucket=bucketname, Key=csv_filename)
        file_content = csv_fileObj["Body"].read().decode("utf-8").split()

I have tried put my csv contents into a list but didnt work.

 results = []
        for row in csv.DictReader(file_content):
         results.append(row.values())
        print(results)
        print(file_content)
        return {
           'statusCode': 200,
           'body': json.dumps('S3 file processed')
         }

Is there anyway I could convert "file_content" into a dataframe in Lambda? I have multiple columns to load.

Later I would follow this approach to load the data into RDS

import pyodbc
import pandas as pd
# insert data from csv file into dataframe(df).
server = 'yourservername' 
database = 'AdventureWorks' 
username = 'username' 
password = 'yourpassword' 
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
# Insert Dataframe into SQL Server:
for index, row in df.iterrows():
     cursor.execute("INSERT INTO HumanResources.DepartmentTest (DepartmentID,Name,GroupName) values(?,?,?)", row.DepartmentID, row.Name, row.GroupName)
cnxn.commit()
cursor.close()

Can anyone suggest how to go about it?

Upvotes: 0

Views: 2266

Answers (2)

John Rotenstein
John Rotenstein

Reputation: 269350

It appears that your goal is to load the contents of a CSV file from Amazon S3 into SQL Server.

You could do this without using Dataframes:

  • Loop through the Event Records (multiple can be passed-in)
  • For each object:
    • Download the object to /tmp/
    • Use the Python CSVReader to loop through the contents of the file
    • Generate INSERT statements to insert the data into the SQL Server table

You might also consider using aws-data-wrangler: Pandas on AWS, which is available as a Lambda Layer.

Upvotes: 1

Simon Hawe
Simon Hawe

Reputation: 4539

You can use io.BytesIO to get the bytes data into memory and after that use pandasread_csv to transform it into a dataframe. Note that there is some strange SSL download limit for dataframes that will lead to issue when downloading data > 2GB. That is why I have used this chunking in the code below.

import io
obj = s3.get_object(Bucket=bucketname, Key=csv_filename)
# This should prevent the 2GB download limit from a python ssl internal
chunks = (chunk for chunk in obj["Body"].iter_chunks(chunk_size=1024**3))
data = io.BytesIO(b"".join(chunks)) # This keeps everything fully in memory
df = pd.read_csv(data) # here you can provide also some necessary args and kwargs

Upvotes: 1

Related Questions