sunshooter
sunshooter

Reputation: 7

Python Log message into tokens

I have a log message in the format

[2013-Mar-05 18:21:45.415053] (ThreadID) <Module name> [Logging level]    Message Desciption : This is the message.

I want to create dictionary in the form

{'time stamp': 2013-Mar-05 18:21:45.415053, 'ThreadId': 4139, 'Module name': ModuleA , 'Message Description': My Message, 'Message' : This is the message }

I try splitting the log message using split on white spaces and then I can select the tokens and make the list. Something like this:

for i in line1.split(" "):

This will give tokens like this

['2013-Mar-05', '18:21:45.415053]', '(ThreadID)', '<Module name>', '[Logging level]',    'Message Desciption', ':', 'This is the message.']

And then pick and choose the tokens and put into the required list.

Is there any better way to extract the tokens in this case. There is a pattern here like time stamp will be in [] bracket , threadId will be inside (), module name will be inside <>. Can we leverage this info and extract the token directly?

Upvotes: 1

Views: 1384

Answers (5)

msvalkon
msvalkon

Reputation: 12077

Here's a very similar answer to @Oli however the regex is a bit more readable and I use groupdict() so there's no need to form a new dictionary as it is created by the regexp. The log string is parsed left to right, consuming each match.

fmt = re.compile(
      r'\[(?P<timestamp>.+?)\]\s+' # Save everything within [] to group timestamp
      r'\((?P<thread_id>.+?)\)\s+' # Save everything within () to group thread_id
      r'\<(?P<module_name>.+?)\>\s+' # Save everything within <> to group module_name
      r'\[(?P<log_level>.+?)\]\s+' # Save everything within [] to group to log_level
      r'(?P<message_desc>.+?)(\s:\s|$)' # Save everything before \s:\s or end of line to           group message_desc,
      r'(?P<message>.+$)?' # if there was a \s:\s, save everything after it to group   message. This last group is optional
      )

log = '[2013-Mar-05 18:21:45.415053] (4139) <ModuleA> [DEBUG]  Message Desciption : An example message!'

match = fmt.search(log)

print match.groupdict()

Examples:

log = '[2013-Mar-05 18:21:45.415053] (4139) <ModuleA> [DEBUG]  Message Desciption : An       example message!'
match = fmt.search(log)

print match.groupdict() 
{'log_level': 'DEBUG',
 'message': 'An example message!',
 'module_name': 'ModuleA',
 'thread_id': '4139',
 'timestamp': '2013-Mar-05 18:21:45.415053'}

Example with your first test string from the comments of this answer

log = '[2013-Mar-05 18:21:45.415053] (0x7aa5e3a0) <Logger> [Info] Opened settings file : /usr/local/ABC/ABC/var/loggingSettings.ini'

match = fmt.search(log)

print match.groupdict()
{'log_level': 'Info',
 'message': '/usr/local/ABC/ABC/var/loggingSettings.ini',
 'message_desc': 'Opened settings file',
 'module_name': 'Logger',
 'thread_id': '0x7aa5e3a0',
 'timestamp': '2013-Mar-05 18:21:45.415053'}

Example with your second test string from the comments of this answer:

log = '[2013-Mar-05 18:21:45.415053] (0x7aa5e3a0) <Logger> [Info] Creating a new settings file'

match = fmt.search(log)

print match.groupdict()
{'log_level': 'Info',
 'message': None,
 'message_desc': 'Creating a new settings file',
 'module_name': 'Logger',
 'thread_id': '0x7aa5e3a0',
 'timestamp': '2013-Mar-05 18:21:45.415053'}

EDIT: Fixed to work with OP's examples.

Upvotes: 2

swappy
swappy

Reputation: 108

If you have a consistent log format, why not use macros for indexes?

Example

DATE = 0
TIME = 1
TID = 2
MODULE = 3
LOG_LVL = 4
MESSAGE = 5 (or more like 7)

log = ['2013-Mar-05', '18:21:45.415053]', '(ThreadID)', '<Module name>', '[Logging level]',    'Message Desciption', ':', 'This is the message.']

And then just access either using log[DATE] or what not? Eventually by using " ".join on the chunks you want to stitch together before using an index based accessing. Then you can populate your dictionary however you may wish to.

It's not as neat as Oli's solution but it can do the work :)

Upvotes: 0

pradyunsg
pradyunsg

Reputation: 19406

Although using re in this case is simpler, in-case you don't want to use it,
Try this,

string = '[2013-Mar-05 18:21:45.415053] (ThreadID) <Module name> [Logging level]    Message Desciption : This is the message.'

# the main function, return the items between start and end.
def get_between(start, end, string):
    in_between = 0
    c_str = ''
    items = []
    indexes = []
    for i in range(len(string)):
        char = string[i]
        if char == start:
            if in_between == 0: indexes.append(i) # if starting bracket
            in_between += 1
        elif char == end:
            in_between -= 1
            if in_between == 0: indexes.append(i) # if ending bracket
        elif in_between > 0:
            c_str += char
        if in_between == 0 and c_str != '': # after ending bracket
            items.append(c_str)
            c_str = ''
    return items, indexes

# As both Time Stamp, and Logging Level are between []s,
# And as message comes after Logging Level,
data,last_indexes = get_between('[',']',string)
time_stamp, logging = data
# We only want the first item in the first list
thread_id = get_between('(',')',string)[0][0]
module = get_between('<','>',string)[0][0]

last = max(last_indexes)
# extracting the message    
message = ''.join(string[last+1:].split(':')[1:]).strip()

mydict = {'Time':time_stamp, 'Thread ID':thread_id,'Module':module,'Logging Level':logging,'Message':message}
print mydict

Here, we get the characters between the 2 "classifiers" and work with them...

Upvotes: 0

Dhara
Dhara

Reputation: 6767

How about the following? (The comments explain what's going on)

log = '[2013-Mar-05 18:21:45.415053] (ThreadID) <Module name> [Logging level]    Message Description : This is the message.'

# Define functions on how to proces the different kinds of tokens
time_stamp = logging_level = lambda x: x.strip('[ ]')
thread_ID = lambda x: x.strip('( )')
module_name = lambda x: x.strip('< >')
message_description = message = lambda x: x

# Names of the tokens used to make the dictionary keys
keys = ['time stamp', 'ThreadId',
        'Module name', 'Logging level',
        'Message Description', 'Message']
# Define functions on how to process the message
funcs = [time_stamp, thread_ID,
         module_name, logging_level,
         message_description, message]
# Define the tokens at which to split the message
split_on = [']', ')', '>', ']', ':']

msg_dict = {}

for i in range(len(split_on)):
    # Split up the log one token at a time
    temp, log = log.split(split_on[i], 1)
    # Process the token using the defined function
    msg_dict[keys[i]] = funcs[i](temp) 

msg_dict[keys[i]] = funcs[i](log) # Process the last token
print msg_dict

Upvotes: 0

Oli
Oli

Reputation: 2452

Using regular expression, hope this helps!

import re

string = '[2013-Mar-05 18:21:45.415053] (4444) <Module name> [Logging level]  Message Desciption : This is the message.'

regex = re.compile(r'\[(?P<timestamp>[^\]]*?)\] \((?P<threadid>[^\)]*?)\) \<(?P<modulename>[^\>]*?)\>[^:]*?\:(?P<message>.*?)$')

for match in regex.finditer(string):
    dict = {'timestamp': match.group("timestamp"), 'threadid': match.group("threadid"), 'modulename': match.group('modulename'), 'message': match.group('message')}

print dict

output:

{'timestamp': '2013-Mar-05 18:21:45.415053', 'message': ' This is the message.', 'modulename': 'Module name', 'threadid': '4444'}

Explanation: I'm using groups to mark parts of my regex for use in the script later. See http://docs.python.org/2/library/re.html for more info. Basically I'm going through the line from left to right, looking for the delimiters [,<,( etc.

Upvotes: 1

Related Questions