jimiclapton
jimiclapton

Reputation: 889

Error using Pandas read_csv from S3 bucket in AWS lambda function - Expected 1 fields in line 5, saw 2

Reading a csv file from an S3 bucket using Pandas read_csv in AWS lambda function and keep seeing a tokenisation error relating to the contents of the csv.

First 5 lines as follows (pasted from text editor)

ItemID   |  NameID     | Users | Days | Pricing |     Expiration  | Status
-----------------------------------------------------------------------
370915293| aaaaqqq.abc |   0   |   0  |   $12   | 05/10/2021 11:44| Ran
371192969| aaacns.abc  |   7   |   0  |   $12   | 05/08/2021 09:34| Ran
370905229| aaamix.abc  |   0   |   0  |   $12   | 05/07/2021 10:32| Ran
371459366| aaapdf.abc  |  28   |   0  |   $12   | 05/11/2021 12:55| Ran

When I use the command:

rawdata = pd.read_csv(io.BytesIO(obj['Body'].read()),sep=',')

I see the following error:

Error tokenizing data. C error: Expected 1 fields in line 5, saw 2

Having explored the csv file it is not immediately obvious to me why there is an issue with line 5.

Opening the file in any other environment (Jupyter notebook, Pycharm etc) gives no issues whatsoever. The issue seems to be specific to AWS/Lambda functions interpretation of this particular file.

I have also tried appending header=False and header=0 to force the recognition of the 7 headers but this does not seem to alleviate the problem.

I have also tried specifying the parsing engine as engine = 'python' as per a previous suggestion but this introduced a different error like that below.

pandas.errors.ParserError: ',' expected after '"'

Research has led me to understand that I can skip/ignore erroneous rows using skiprows=x but I do not wish to resort to this as I would like to understand and rectify the issue.

Is there anything else I can do to identify and isolate the issue?

Thanks

Upvotes: 1

Views: 961

Answers (1)

Pawan Jain
Pawan Jain

Reputation: 825

Got this error a couple of times, solved it by using lineterminator like this. The default value is \r\n. I think AWS changed the way to store values.

rawdata = pd.read_csv(io.BytesIO(obj['Body'].read()),sep=',', lineterminator='\n')

Upvotes: 1

Related Questions