Ken eXs
Ken eXs

Reputation: 51

Error tokenizing data > How to removed extra \t in a row or line in a data(CSV)

First, I already run through several question and answers here about my problem (Error tokenizing data) but I haven't got any working answer because I think my issue is not the same with the one I read.

I have a tab separated data (saved as .xls file) and be able to read thru pandas read_csv but encountered error on specific line during parsing. The data consist of ~58K rows with blank row every other row.

Sample if opened in excel: enter image description here

When reading the file through read_csv:

df = pd.read_csv('8176.xls', header=None, sep='\t', encoding='cp1252')

And this give an error: ParserError: Error tokenizing data. C error: Expected 12 fields in line 13245, saw 13

I read several thread that this could be an header error or a separator error.

I was able to locate the row/line which is encountering error through running my read_csv with nrows and it was in line 6623 of the data.

df = pd.read_csv('8176.xls', header=None, sep='\t', encoding='cp1252', nrows=6623)

Through readlines, I think this is the one causing the error:

There is unwanted tab delimiter (\t) inserted on this line: enter image description here

The same issue when opening the file in excel, the data in that cell is splited and shifted by 1 column. enter image description here

But when opening the file in notepad, it is only a space character not a Tab. enter image description here

Is there any work around to eliminate the unwanted \t?

Or any encoding to read this as space only?

I already tried different encoding but also encountered errors.

Thank you so much.

Upvotes: 0

Views: 135

Answers (1)

sitting_duck
sitting_duck

Reputation: 3720

You can fix this programmatically during read_csv. Check this answer out:

Starting with pandas 1.4.0, read_csv() delivers capability that allows you to handle these situations in a more graceful and intelligent fashion by allowing a callable to be assigned to on_bad_lines=.

Pandas dataframe read_csv on bad data

Upvotes: 0

Related Questions