Reputation: 547
I have log files, which have many lines in the form of :
LogLevel [13/10/2015 00:30:00.650] [Message Text]
My goal is to convert each line in the log file into a nice Data frame. I have tired to do that, by splitting the lines on the [ character, however I am still not getting a neat dataframe.
My code:
level = []
time = []
text = []
with open(filename) as inf:
for line in inf:
parts = line.split('[')
if len(parts) > 1:
level = parts[0]
time = parts[1]
text = parts[2]
print (parts[0],parts[1],parts[2])
s1 = pd.Series({'Level':level, 'Time': time, 'Text':text})
df = pd.DataFrame(s1).reset_index()
Heres my printed Data frame:
Info 10/08/16 10:56:09.843] In Function CCatalinaPrinter::ItemDescription()]
Info 10/08/16 10:56:09.843] Sending UPC Description Message ]
How can I improve this to strip the whitespace and the other ']' character
Thank you
Upvotes: 10
Views: 28540
Reputation: 8266
I had to parse mine manually since my separator showed up in my message body and the message body would span multiple lines as well, eg if an exception were thrown from my Flask application and the stack track recorded.
Here's my log creation format...
logging.basicConfig( filename="%s/%s_MyApp.log" % ( Utilities.logFolder , datetime.datetime.today().strftime("%Y%m%d-%H%M%S")) , level=logging.DEBUG, format="%(asctime)s,%(name)s,%(process)s,%(levelno)u,%(message)s", datefmt="%Y-%m-%d %H:%M:%S" )
And the parsing code in my Utilities module
Utilities.py
import re
import pandas
logFolder = "./Logs"
logLevelToString = { "50" : "CRITICAL",
"40" : "ERROR" ,
"30" : "WARNING" ,
"20" : "INFO" ,
"10" : "DEBUG" ,
"0" : "NOTSET" } # https://docs.python.org/3.6/library/logging.html#logging-levels
def logFile2DataFrame( filePath ) :
dfLog = pandas.DataFrame( columns=[ 'Timestamp' , 'Module' , 'ProcessID' , 'Level' , 'Message' ] )
tsPattern = "^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},"
with open( filePath , 'r' ) as logFile :
numRows = -1
for line in logFile :
if re.search( tsPattern , line ) :
tokens = line.split(",")
timestamp = tokens[0]
module = tokens[1]
processID = tokens[2]
level = logLevelToString[ tokens[3] ]
message = ",".join( tokens[4:] )
numRows += 1
dfLog.loc[ numRows ] = [ timestamp , module , processID , level , message ]
else :
# Multiline message, integrate it into last record
dfLog.loc[ numRows , 'Message' ] += line
return dfLog
I actually created this helper message to allow me to view my logs directly from my Flask app as I have a handy template that renders a DataFrame. Should accelerate debugging a bunch since encasing the flaskapp in a Tornado WSGI server prevents the display of the normal debug page visible from Flask when an exception gets thrown. If anyone knows how to restore that functionality in such a usage please share.
Upvotes: 3
Reputation: 862601
You can use read_csv
with separator \s*\[
- whitespaces with [
:
import pandas as pd
from pandas.compat import StringIO
temp=u"""LogLevel [13/10/2015 00:30:00.650] [Message Text]
LogLevel [13/10/2015 00:30:00.650] [Message Text]
LogLevel [13/10/2015 00:30:00.650] [Message Text]
LogLevel [13/10/2015 00:30:00.650] [Message Text]"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep="\s*\[", names=['Level','Time','Text'], engine='python')
Then remove ]
by strip
and convert column Time
to_datetime
:
df.Time = pd.to_datetime(df.Time.str.strip(']'), format='%d/%m/%Y %H:%M:%S.%f')
df.Text = df.Text.str.strip(']')
print (df)
Level Time Text
0 LogLevel 2015-10-13 00:30:00.650 Message Text
1 LogLevel 2015-10-13 00:30:00.650 Message Text
2 LogLevel 2015-10-13 00:30:00.650 Message Text
3 LogLevel 2015-10-13 00:30:00.650 Message Text
print (df.dtypes)
Level object
Time datetime64[ns]
Text object
dtype: object
Upvotes: 13