Dangraf
Dangraf

Reputation: 99

python,Read CSV with too many separators

I'm trying to read a log file using pandas in python with the following format:

TimeStamp, Label, VarName, Data
2017-03-09 13:03:20, USER_INPUT, Name, Test 1
2017-03-09 13:03:20, USER_INPUT, ProjectNumber, 04BGA
2017-03-09 13:03:20, USER_INPUT, PasteType, Senju F100
.
.

which works fine until i reach the following type of line:

2017-03-09 12:03:20, CASETTE_INFO, CassetteHSPGetEjectorFeedAdjust, 2,2,2,2.4,3.5,1.15,1.1

Is it possible to tell pandas somehow that if there are more than 3 separators, then squeeze the rest together as a string? like

data = "2,2,2,2.4,3.5,1.15,1.1" 

Or is there any other solutions?

I'm just skipping these lines at the moment to go around the problem

pd.read_csv(filename, error_bad_lines=False)) 

It would be nice to solve this type of problem in my log-file

Upvotes: 1

Views: 233

Answers (1)

jezrael
jezrael

Reputation: 863531

I think you need change separator to ,\s+ - it means comma with one or more whitespaces:

df = pd.read_csv(filename, sep=",\s+", engine='python')

Sample:

import pandas as pd
from pandas.compat import StringIO

temp=u"""imeStamp, Label, VarName, Data
2017-03-09 13:03:20, USER_INPUT, Name, Test 1
2017-03-09 13:03:20, USER_INPUT, ProjectNumber, 04BGA
2017-03-09 13:03:20, USER_INPUT, PasteType, Senju F100
2017-03-09 12:03:20, CASETTE_INFO, CassetteHSPGetEjectorFeedAdjust, 2,2,2,2.4,3.5,1.15,1.1"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep=",\s+", engine='python')
print (df)
              imeStamp         Label                          VarName  \
0  2017-03-09 13:03:20    USER_INPUT                             Name   
1  2017-03-09 13:03:20    USER_INPUT                    ProjectNumber   
2  2017-03-09 13:03:20    USER_INPUT                        PasteType   
3  2017-03-09 12:03:20  CASETTE_INFO  CassetteHSPGetEjectorFeedAdjust   

                     Data  
0                  Test 1  
1                   04BGA  
2              Senju F100  
3  2,2,2,2.4,3.5,1.15,1.1  

If 1.solution does not work use separator which is not in data - ¥ or | and read all data to one column data. Then use str.split:

import pandas as pd
from pandas.compat import StringIO

temp=u"""TimeStamp, Label, VarName, Data
2017-03-09 13:03:20, USER_INPUT, Name, Test 1
2017-03-09 13:03:20, USER_INPUT, ProjectNumber, 04BGA
2017-03-09 13:03:20, USER_INPUT, PasteType, Senju F100
2017-03-09 12:03:20, CASETTE_INFO, CassetteHSPGetEjectorFeedAdjust, 2,2,2,2.4,3.5,1.15,1.1"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep = '¥', names=['data'], skiprows=1)
print (df)
                                                data
0      2017-03-09 13:03:20, USER_INPUT, Name, Test 1
1  2017-03-09 13:03:20, USER_INPUT, ProjectNumber...
2  2017-03-09 13:03:20, USER_INPUT, PasteType, Se...
3  2017-03-09 12:03:20, CASETTE_INFO, CassetteHSP...

df = df.data.str.split(',', n=3, expand=True)
df.columns = ['TimeStamp','Label','VarName','Data']
print (df)
             TimeStamp          Label                           VarName  \
0  2017-03-09 13:03:20     USER_INPUT                              Name   
1  2017-03-09 13:03:20     USER_INPUT                     ProjectNumber   
2  2017-03-09 13:03:20     USER_INPUT                         PasteType   
3  2017-03-09 12:03:20   CASETTE_INFO   CassetteHSPGetEjectorFeedAdjust   

                      Data  
0                   Test 1  
1                    04BGA  
2               Senju F100  
3   2,2,2,2.4,3.5,1.15,1.1  

Upvotes: 3

Related Questions