Reputation: 4821
Pandas correctly errors out the rows which contain more fields than the header in a csv, however it adds NaN
to rows containing lesser fields even if there is no trailing ,
indicating an empty field.
My csv:
id,name,pin,city
1,abc,123,SJ
2,xyz,789
3,pqr,456,AL
4,qwe,345,
When I try to read this via pandas:
>>> import pandas
>>> a = pandas.read_csv('test.csv', error_bad_lines=False)
>>> a
id name pin city
0 1 abc 123 SJ
1 2 xyz 789 NaN
2 3 pqr 456 AL
3 4 qwe 345 NaN
>>>
Here row 4 is read with NaN
in city value, which is correct since last ,
indicates an empty field. But line 2 should error out/not read into the dataframe. Any way to achieve this?
Upvotes: 1
Views: 382
Reputation: 863711
You can preprocessing values for find rows with not equal length and passes to parameter skiprows
in read_csv
:
out = []
with open('test.csv', 'r') as csvfile:
reader = csv.reader(csvfile)
head = next(reader)
for i, row in enumerate(reader):
if len(row) != len(head):
out.append(i)
print (out)
[1]
df = pd.read_csv('test.csv', skiprows=np.array(out) + 1)
print(df)
id name pin city
0 1 abc 123 SJ
1 3 pqr 456 AL
2 4 qwe 345 NaN
Upvotes: 1