Reputation: 43
I'm trying to read a unstructured csv file using pandas read_csv()
. The problem is some of the files have rows with extra columns as shown below in the sample input.
sample input:
col0,col1,col2
a,b,c
a,b,c,d
a,b,c
a,b,c,d
While handling these kind of files the program throws some ParseError
ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4
sample output :
col0 | col1 | col2 | col3
a | b | c | NaN
a | b | c | d
a | b | c | NaN
a | b | c | d
I don't want to ignore the lines with error_bad_lines = False
parameter in pandas read_csv()
.
Any kind of help will be highly appreciated.
Upvotes: 2
Views: 2408
Reputation: 863166
One possible solution is preprocessing first and find max number of separators, and set parameter names
by range
:
path_csv = 'file.csv'
with open(path_csv) as f:
lines = f.readlines()
num = max(l.count(',') for l in lines) + 1
print (num)
4
df = pd.read_csv(path_csv, names=range(num))
print (df)
0 1 2 3
0 col0 col1 col2 NaN
1 a b c NaN
2 a b c d
3 a b c NaN
4 a b c d
Similar if header is not important, so possible remove it:
df = pd.read_csv(path_csv, names=range(num), skiprows=1)
print (df)
0 1 2 3
0 a b c NaN
1 a b c d
2 a b c NaN
3 a b c d
Another more dynamic solution is add values to header:
path_csv = 'file.csv'
with open(path_csv) as f:
lines = f.readlines()
#get header to list
header = [x.strip() for x in lines[0].split(',')]
#get max number of separator
max_num = max(l.count(',') for l in lines)
#add missing header values by range
if len(header) < max_num + 1:
header = header + list(range(max_num-len(header) + 1))
print (header)
['col0', 'col1', 'col2', 0]
df = pd.read_csv(path_csv, names=header, skiprows=1)
print (df)
col0 col1 col2 0
0 a b c NaN
1 a b c d
2 a b c NaN
3 a b c d
Upvotes: 2