Harish Gopi
Harish Gopi

Reputation: 43

handling of unstructured data in pandas

I'm trying to read a unstructured csv file using pandas read_csv(). The problem is some of the files have rows with extra columns as shown below in the sample input.

sample input:

col0,col1,col2
a,b,c
a,b,c,d
a,b,c 
a,b,c,d

While handling these kind of files the program throws some ParseError

ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 4

sample output :

col0 | col1 | col2 | col3
a    | b    | c    | NaN
a    | b    | c    | d
a    | b    | c    | NaN
a    | b    | c    | d

I don't want to ignore the lines with error_bad_lines = False parameter in pandas read_csv().

Any kind of help will be highly appreciated.

Upvotes: 2

Views: 2408

Answers (1)

jezrael
jezrael

Reputation: 863166

One possible solution is preprocessing first and find max number of separators, and set parameter names by range:

path_csv = 'file.csv'
with open(path_csv) as f:
    lines = f.readlines()
    num = max(l.count(',') for l in lines) + 1
    print (num)
    4

df = pd.read_csv(path_csv, names=range(num))
print (df)
      0     1     2    3
0  col0  col1  col2  NaN
1     a     b     c  NaN
2     a     b     c    d
3     a     b    c   NaN
4     a     b     c    d

Similar if header is not important, so possible remove it:

df = pd.read_csv(path_csv, names=range(num), skiprows=1)
print (df)
   0  1   2    3
0  a  b   c  NaN
1  a  b   c    d
2  a  b  c   NaN
3  a  b   c    d

Another more dynamic solution is add values to header:

path_csv = 'file.csv'
with open(path_csv) as f:
    lines = f.readlines()
    #get header to list
    header = [x.strip() for x in lines[0].split(',')]
    #get max number of separator
    max_num = max(l.count(',') for l in lines)
    #add missing header values by range
    if len(header) < max_num + 1:
        header = header + list(range(max_num-len(header) + 1))
    print (header)
['col0', 'col1', 'col2', 0]

df = pd.read_csv(path_csv, names=header, skiprows=1)
print (df)
  col0 col1 col2    0
0    a    b    c  NaN
1    a    b    c    d
2    a    b   c   NaN
3    a    b    c    d

Upvotes: 2

Related Questions