Kalenji
Kalenji

Reputation: 407

AWS Lambda - read csv and convert to pandas dataframe

I have got a simple Lambda code to read the csv file from S3 Bucket. All is working fine however I tried to get the csv data to pandas data frame and the error comes up string indices must be integers

My code is bog-standard but I just need to use the csv as a data frame for further manipulation. The hashed line is the source of the error. I can print data with no problems so the bucket and file details are configured properly.

updated code

import json
import pandas as pd
import numpy as np
import requests
import glob
import time
import os
from datetime import datetime
from csv import reader
import boto3
import traceback
import io

s3_client = boto3.client('s3')

def lambda_handler(event, context):
    try:
            
        bucket_name = event["Records"][0]["s3"]["bucket"]["name"]
        s3_file_name = event["Records"][0]["s3"]["object"]["key"]
        resp = s3_client.get_object(Bucket=bucket_name, Key=s3_file_name)
        
        data = resp['Body'].read().decode('utf-8')
        df=pd.DataFrame( list(reader(data)))
        print (df.head())

    except Exception as err:
        print(err)
        

        
        
    # TODO implement
    return {
        'statusCode': 200,
        'body': json.dumps('Hello fr2om Lambda!')
    }
    
    traceback.print_exc()

Upvotes: 8

Views: 18782

Answers (3)

François B.
François B.

Reputation: 1174

import json
import pandas as pd
import numpy as np
import requests
import glob
import time
import os
from datetime import datetime
from csv import reader
import boto3
import io

s3_client = boto3.client('s3')

def lambda_handler(event, context):
    try:
            
        bucket_name = event["Records"][0]["s3"]["bucket"]["name"]
        s3_file_name = event["Records"][0]["s3"]["object"]["key"]
        obj = s3_client.get_object(Bucket=bucket_name, Key= s3_file_name)
        df = pd.read_csv(obj['Body']) # 'Body' is a key word
        print(df.head())

    except Exception as err:
        print(err)
        
    # TODO implement
    return {
        'statusCode': 200,
        'body': json.dumps('Hello fr2om Lambda!')
    }

Upvotes: 5

Iñigo González
Iñigo González

Reputation: 3955

You can read the S3 file directly from pandas using read_csv:

s3_client = boto3.client('s3')

def lambda_handler(event, context):
    try:            
        bucket_name = event["Records"][0]["s3"]["bucket"]["name"]
        s3_file_name = event["Records"][0]["s3"]["object"]["key"]

        # This 'magic' needs s3fs (https://pypi.org/project/s3fs/)
        df=pd.read_csv(f's3://{bucket_name}/{s3_file_name}', sep=',')

        print (df.head())

    except Exception as err:
        print(err)

Things to remember:

   # Track memory usage at cost of CPU. Great for troubleshooting. Use wisely.
   print(df.info(verbose=True, memory_usage='deep'))  

Upvotes: 1

Life is complex
Life is complex

Reputation: 15619

I believe that your problem is likely tied to this line - df=pd.DataFrame( list(reader(data))) in your function. The answer below should allow you to read the csv file into the pandas dataframe for processes.

import boto3
import pandas as pd
from io import BytesIO

s3_client = boto3.client('s3')

def lambda_handler(event, context):
   try:
       bucket_name = event["Records"][0]["s3"]["bucket"]["name"]
       s3_file_name = event["Records"][0]["s3"]["object"]["key"]
       resp = s3_client.get_object(Bucket=bucket_name, Key=s3_file_name)

       ###########################################
       # one of these methods should work for you. 
       # Method 1
       # df_s3_data = pd.read_csv(resp['Body'], sep=',')
       #
       # Method 2
       # df_s3_data = pd.read_csv(BytesIO(resp['Body'].read().decode('utf-8')))
       ###########################################
       print(df_s3_data.head())

   except Exception as err:
      print(err)

Upvotes: 11

Related Questions