Reputation: 99
I'm trying to read a log file using pandas in python with the following format:
TimeStamp, Label, VarName, Data
2017-03-09 13:03:20, USER_INPUT, Name, Test 1
2017-03-09 13:03:20, USER_INPUT, ProjectNumber, 04BGA
2017-03-09 13:03:20, USER_INPUT, PasteType, Senju F100
.
.
which works fine until i reach the following type of line:
2017-03-09 12:03:20, CASETTE_INFO, CassetteHSPGetEjectorFeedAdjust, 2,2,2,2.4,3.5,1.15,1.1
Is it possible to tell pandas somehow that if there are more than 3 separators, then squeeze the rest together as a string? like
data = "2,2,2,2.4,3.5,1.15,1.1"
Or is there any other solutions?
I'm just skipping these lines at the moment to go around the problem
pd.read_csv(filename, error_bad_lines=False))
It would be nice to solve this type of problem in my log-file
Upvotes: 1
Views: 233
Reputation: 863531
I think you need change separator to ,\s+
- it means comma with one or more whitespaces:
df = pd.read_csv(filename, sep=",\s+", engine='python')
Sample:
import pandas as pd
from pandas.compat import StringIO
temp=u"""imeStamp, Label, VarName, Data
2017-03-09 13:03:20, USER_INPUT, Name, Test 1
2017-03-09 13:03:20, USER_INPUT, ProjectNumber, 04BGA
2017-03-09 13:03:20, USER_INPUT, PasteType, Senju F100
2017-03-09 12:03:20, CASETTE_INFO, CassetteHSPGetEjectorFeedAdjust, 2,2,2,2.4,3.5,1.15,1.1"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep=",\s+", engine='python')
print (df)
imeStamp Label VarName \
0 2017-03-09 13:03:20 USER_INPUT Name
1 2017-03-09 13:03:20 USER_INPUT ProjectNumber
2 2017-03-09 13:03:20 USER_INPUT PasteType
3 2017-03-09 12:03:20 CASETTE_INFO CassetteHSPGetEjectorFeedAdjust
Data
0 Test 1
1 04BGA
2 Senju F100
3 2,2,2,2.4,3.5,1.15,1.1
If 1.solution does not work use separator which is not in data - ¥
or |
and read all data to one column data
. Then use str.split
:
import pandas as pd
from pandas.compat import StringIO
temp=u"""TimeStamp, Label, VarName, Data
2017-03-09 13:03:20, USER_INPUT, Name, Test 1
2017-03-09 13:03:20, USER_INPUT, ProjectNumber, 04BGA
2017-03-09 13:03:20, USER_INPUT, PasteType, Senju F100
2017-03-09 12:03:20, CASETTE_INFO, CassetteHSPGetEjectorFeedAdjust, 2,2,2,2.4,3.5,1.15,1.1"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep = '¥', names=['data'], skiprows=1)
print (df)
data
0 2017-03-09 13:03:20, USER_INPUT, Name, Test 1
1 2017-03-09 13:03:20, USER_INPUT, ProjectNumber...
2 2017-03-09 13:03:20, USER_INPUT, PasteType, Se...
3 2017-03-09 12:03:20, CASETTE_INFO, CassetteHSP...
df = df.data.str.split(',', n=3, expand=True)
df.columns = ['TimeStamp','Label','VarName','Data']
print (df)
TimeStamp Label VarName \
0 2017-03-09 13:03:20 USER_INPUT Name
1 2017-03-09 13:03:20 USER_INPUT ProjectNumber
2 2017-03-09 13:03:20 USER_INPUT PasteType
3 2017-03-09 12:03:20 CASETTE_INFO CassetteHSPGetEjectorFeedAdjust
Data
0 Test 1
1 04BGA
2 Senju F100
3 2,2,2,2.4,3.5,1.15,1.1
Upvotes: 3