Reputation: 2571
I'm processing lots (thousands) of ~100k line csv files that are produced by someone else. 9 times out of 10 the files have 8 columns and all is right with the world. The 10th time or so ~10 lines will have 2 extra columns inserted after column 6: (For simplicity lets assume the values in all the rows have the same value.)
A,B,C,D,E,F,G,H
A,B,C,D,E,F,G,H
A,B,C,D,E,F,Foo,Bar,G,H
A,B,C,D,E,F,G,H
A,B,C,D,E,F,Foo,Bar,G,H
A,B,C,D,E,F,G,H
A,B,C,D,E,F,G,H
I don't have control over the generation of the data files and need to clean them on my end, but I believe that rows with extra columns have corrupted data so I just want to reject them for now. I figured a simple way to handle this would be to initially load my data into a 10 column DataFrame:
In [100]: df = pd.read_csv(data_dir + data_file, names=ColumnNames)
In [101]: data_df
Out[101]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 99531 entries, 0 to 99530
Data columns:
time 99531 non-null values
var1 99531 non-null values
var2 99531 non-null values
var3 99531 non-null values
var4 99531 non-null values
var5 99531 non-null values
var6 98386 non-null values
var7 29829 non-null values
extra1 10 non-null values
extra2 10 non-null values
dtypes: float64(3), int64(5), object(2)
And then check for where extra1 or extra2 isnull, keep those rows, and then drop the extra rows.
data_df = data_df[pd.isnull(data_df['extra1']) & pd.isnull(data_df['extra2'])]
del data_df['extra1']
del data_df['extra2']
This seems a little round about / non-ideal. Does anyone have a better idea of how to clean this?
Thanks
Upvotes: 4
Views: 2643
Reputation: 353009
If you want to drop the bad lines, you might be able to use error_bad_lines=False
(and warn_bad_lines = False
if you want it to be quiet about it):
>>> !cat unclean.csv
A,B,C,D,E,F,G,H
A,B,C,D,E,F,G,H
A,B,C,D,E,F,Foo,Bar,G,H
A,B,C,D,E,F,G,H
A,B,C,D,E,F,Foo,Bar,G,H
A,B,C,D,E,F,G,H
A,B,C,D,E,F,G,H
>>> df = pd.read_csv("unclean.csv", error_bad_lines=False, header=None)
Skipping line 3: expected 8 fields, saw 10
Skipping line 5: expected 8 fields, saw 10
>>> df
0 1 2 3 4 5 6 7
0 A B C D E F G H
1 A B C D E F G H
2 A B C D E F G H
3 A B C D E F G H
4 A B C D E F G H
Upvotes: 4