ukbaz
ukbaz

Reputation: 547

Log file to Pandas Dataframe

I have log files, which have many lines in the form of :

LogLevel    [13/10/2015 00:30:00.650]  [Message Text]

My goal is to convert each line in the log file into a nice Data frame. I have tired to do that, by splitting the lines on the [ character, however I am still not getting a neat dataframe.

My code:

level = []
time = []
text = []

   with open(filename) as inf:
     for line in inf:
       parts = line.split('[')
         if len(parts) > 1:  
           level = parts[0]
           time = parts[1]
           text = parts[2]
        print (parts[0],parts[1],parts[2])

 s1 = pd.Series({'Level':level, 'Time': time, 'Text':text})
 df = pd.DataFrame(s1).reset_index()

Heres my printed Data frame:

Info      10/08/16 10:56:09.843]   In Function CCatalinaPrinter::ItemDescription()]

Info      10/08/16 10:56:09.843]   Sending UPC Description Message ]

How can I improve this to strip the whitespace and the other ']' character

Thank you

Upvotes: 10

Views: 28540

Answers (2)

jxramos
jxramos

Reputation: 8266

I had to parse mine manually since my separator showed up in my message body and the message body would span multiple lines as well, eg if an exception were thrown from my Flask application and the stack track recorded.

Here's my log creation format...

logging.basicConfig( filename="%s/%s_MyApp.log" % ( Utilities.logFolder , datetime.datetime.today().strftime("%Y%m%d-%H%M%S")) , level=logging.DEBUG, format="%(asctime)s,%(name)s,%(process)s,%(levelno)u,%(message)s", datefmt="%Y-%m-%d %H:%M:%S" )

And the parsing code in my Utilities module

Utilities.py

import re
import pandas

logFolder = "./Logs"

logLevelToString = { "50" : "CRITICAL",
                     "40" : "ERROR"   ,
                     "30" : "WARNING" ,
                     "20" : "INFO"    ,
                     "10" : "DEBUG"   ,
                     "0"  : "NOTSET"  } # https://docs.python.org/3.6/library/logging.html#logging-levels

def logFile2DataFrame( filePath ) :
    dfLog = pandas.DataFrame( columns=[ 'Timestamp' , 'Module' , 'ProcessID' , 'Level' , 'Message' ] )
    tsPattern = "^[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2},"

    with open( filePath , 'r' ) as logFile :
        numRows = -1
        for line in logFile :
            if re.search( tsPattern , line ) :
                tokens    = line.split(",")
                timestamp = tokens[0]
                module    = tokens[1]
                processID = tokens[2]
                level     = logLevelToString[ tokens[3] ]
                message   = ",".join( tokens[4:] )
                numRows += 1
                dfLog.loc[ numRows ] = [ timestamp , module , processID , level , message ]
            else :
                # Multiline message, integrate it into last record
                dfLog.loc[ numRows , 'Message' ] += line
    return dfLog

I actually created this helper message to allow me to view my logs directly from my Flask app as I have a handy template that renders a DataFrame. Should accelerate debugging a bunch since encasing the flaskapp in a Tornado WSGI server prevents the display of the normal debug page visible from Flask when an exception gets thrown. If anyone knows how to restore that functionality in such a usage please share.

Upvotes: 3

jezrael
jezrael

Reputation: 862601

You can use read_csv with separator \s*\[ - whitespaces with [:

import pandas as pd
from pandas.compat import StringIO

temp=u"""LogLevel    [13/10/2015 00:30:00.650]  [Message Text]
LogLevel    [13/10/2015 00:30:00.650]  [Message Text]
LogLevel    [13/10/2015 00:30:00.650]  [Message Text]
LogLevel    [13/10/2015 00:30:00.650]  [Message Text]"""
#after testing replace StringIO(temp) to filename
df = pd.read_csv(StringIO(temp), sep="\s*\[", names=['Level','Time','Text'], engine='python')

Then remove ] by strip and convert column Time to_datetime:

df.Time = pd.to_datetime(df.Time.str.strip(']'), format='%d/%m/%Y %H:%M:%S.%f')
df.Text = df.Text.str.strip(']')

print (df)
      Level                    Time          Text
0  LogLevel 2015-10-13 00:30:00.650  Message Text
1  LogLevel 2015-10-13 00:30:00.650  Message Text
2  LogLevel 2015-10-13 00:30:00.650  Message Text
3  LogLevel 2015-10-13 00:30:00.650  Message Text

print (df.dtypes)
Level            object
Time     datetime64[ns]
Text             object
dtype: object

Upvotes: 13

Related Questions