Reputation: 407
I have got a simple Lambda code to read the csv file from S3 Bucket. All is working fine however I tried to get the csv data to pandas data frame and the error comes up string indices must be integers
My code is bog-standard but I just need to use the csv as a data frame for further manipulation. The hashed line is the source of the error. I can print data with no problems so the bucket and file details are configured properly.
updated code
import json
import pandas as pd
import numpy as np
import requests
import glob
import time
import os
from datetime import datetime
from csv import reader
import boto3
import traceback
import io
s3_client = boto3.client('s3')
def lambda_handler(event, context):
try:
bucket_name = event["Records"][0]["s3"]["bucket"]["name"]
s3_file_name = event["Records"][0]["s3"]["object"]["key"]
resp = s3_client.get_object(Bucket=bucket_name, Key=s3_file_name)
data = resp['Body'].read().decode('utf-8')
df=pd.DataFrame( list(reader(data)))
print (df.head())
except Exception as err:
print(err)
# TODO implement
return {
'statusCode': 200,
'body': json.dumps('Hello fr2om Lambda!')
}
traceback.print_exc()
Upvotes: 8
Views: 18782
Reputation: 1174
import json
import pandas as pd
import numpy as np
import requests
import glob
import time
import os
from datetime import datetime
from csv import reader
import boto3
import io
s3_client = boto3.client('s3')
def lambda_handler(event, context):
try:
bucket_name = event["Records"][0]["s3"]["bucket"]["name"]
s3_file_name = event["Records"][0]["s3"]["object"]["key"]
obj = s3_client.get_object(Bucket=bucket_name, Key= s3_file_name)
df = pd.read_csv(obj['Body']) # 'Body' is a key word
print(df.head())
except Exception as err:
print(err)
# TODO implement
return {
'statusCode': 200,
'body': json.dumps('Hello fr2om Lambda!')
}
Upvotes: 5
Reputation: 3955
You can read the S3 file directly from pandas using read_csv:
s3_client = boto3.client('s3')
def lambda_handler(event, context):
try:
bucket_name = event["Records"][0]["s3"]["bucket"]["name"]
s3_file_name = event["Records"][0]["s3"]["object"]["key"]
# This 'magic' needs s3fs (https://pypi.org/project/s3fs/)
df=pd.read_csv(f's3://{bucket_name}/{s3_file_name}', sep=',')
print (df.head())
except Exception as err:
print(err)
Pandas needs s3fs to read remote files - see [Reading Remote Files] in pandas documentation (https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#reading-remote-files)
You'll need to package the s3fs library with your lambda - see AWS Lambda deployment package in Python
If you're using this outside a lambda (for thesting) the tricky part is authentication.
Since you're billed for CPU and Memory usage, Pandas DataFrame.info() might help you to assess CSV memory usage and/or troubleshoot out-of-memory errors:
# Track memory usage at cost of CPU. Great for troubleshooting. Use wisely.
print(df.info(verbose=True, memory_usage='deep'))
Upvotes: 1
Reputation: 15619
I believe that your problem is likely tied to this line - df=pd.DataFrame( list(reader(data))) in your function. The answer below should allow you to read the csv file into the pandas dataframe for processes.
import boto3
import pandas as pd
from io import BytesIO
s3_client = boto3.client('s3')
def lambda_handler(event, context):
try:
bucket_name = event["Records"][0]["s3"]["bucket"]["name"]
s3_file_name = event["Records"][0]["s3"]["object"]["key"]
resp = s3_client.get_object(Bucket=bucket_name, Key=s3_file_name)
###########################################
# one of these methods should work for you.
# Method 1
# df_s3_data = pd.read_csv(resp['Body'], sep=',')
#
# Method 2
# df_s3_data = pd.read_csv(BytesIO(resp['Body'].read().decode('utf-8')))
###########################################
print(df_s3_data.head())
except Exception as err:
print(err)
Upvotes: 11