Junior
Junior

Reputation: 69

Python query: iterating through log file

Please can someone help me solve the following query? I have a log file with thousands of lines like the following:-

    jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 recv: 1 timestamp: 00:00:02,217
    jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 ack: 13 timestamp: 00:00:04,537
    jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 recv: 1 timestamp: 00:00:08,018
    jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 nack: 14 timestamp: 00:00:10,338

I would like to write a python script to iterate through this file and based on the jarid (the second field in the log file) to get the timestamp from each line where the jarid is found and print them on the same line. So for example, for the following two lines:-

    jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 recv: 1 timestamp: 00:00:02,217 
    jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 ack: 13 timestamp: 00:00:04,537

I would get the following output:-

    jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 recv: 00:00:02,217 ack: 00:00:04,537

I think the best way to accomplish this is with a dictionary (or maybe not!, please comment). I have written the following script, which is somewhat working, but it is not giving me the desired output:-

#!/opt/SP/bin/python

    log = file(/opt/SP/logs/generic.log, "r")
    filecontent = log.xreadlines()
    storage = {}
    for line in filecontent:
        line = line.strip()
        jarid, JARID, status, STATUS, timestamp, TIME = line.split(" ")
        if JARID not in storage:
            storage[JARID] = {}
        if STATUS not in storage[JARID]:
            storage[JARID][STATUS] = {}
        if TIME not in storage[JARID][STATUS]:
            storage[JARID][STATUS][TIME] = {}

    jarids = storage.keys()
    jarids.sort()
    for JARID in jarids:
        stats = storage[JARID].keys()
        stats.sort()
        for STATUS in stats:
            times = storage[JARID][STATUS].keys()
            times.sort()
            for TIME in times:
                all = storage[JARID][STATUS][TIME].keys()
                all.sort()

    for JARID in jarids:
        if "1" in storage[JARID].keys() and "13" in storage[JARID].keys():
            print "MSG: %s, RECV: %s, ACK: %s" % (JARID, storage[JARID]["1"], storage[JARID]["13"])
        else:
            if "1" in storage[JARID].keys() and "14" in storage[JARID].keys():
                print "MSG: %s, RECV: %s, NACK: %s" % (JARID, storage[JARID]["1"], storage[JARID]["14"])

When I run this script, I am getting the following output:-

    MSG: 7e5ae720-9151-11e0-eff2-00238bce4216, RECV: {'00:00:02,217': {}}, ACK: {'00:00:04,537': {}}

Please note that I am still learning python and that my scripting skills are not all that!

Please, can you help me figure out how to get the desired output as I wrote above?

Upvotes: 2

Views: 2964

Answers (5)

JBernardo
JBernardo

Reputation: 33397

That should work. Updated.

using:

log = ['jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 recv: 1 timestamp: 00:00:02,217',
       'jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 ack: 13 timestamp: 00:00:04,537',
       'jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 recv: 1 timestamp: 00:00:08,018',
       'jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 nack: 14 timestamp: 00:00:10,338']

you can do:

d = {}
for i in (line.split() for line in log):
    d.setdefault(i[1], {}).update({i[2]:i[-1]})

#as pointed by @gnibbler, you can also use "defaultdict"
#instead of dict with "setdefault"

then you may print it with:

for i,j in d.items():
    print 'jarid:', i,
    for k,m in j.items():
        print k, m,
    print

Upvotes: 0

Rob Cowie
Rob Cowie

Reputation: 22619

This solution is somewhat similar to @JBernardo, though I choose to parse the lines with a regular expression. I've written it now so I may as well publish it; Might be of some use.

import re

line_pattern = re.compile(
    r"jarid: (?P<jarid>[a-z0-9\-]+) (?P<action>[a-z]+): (?P<status>[0-9]+) timestamp: (?P<ts>[0-9\:,]+)"
)

infile = open('/path/to/file.log')
entries = (line_pattern.match(line).groupdict() for line in infile)
events = {}

for entry in entries:
    event = events.setdefault(entry['jarid'], {})
    event[entry['action']] = entry['ts']

for jarid, event in events.iteritems():
    ack_event = 'ack' if 'ack' in event else 'nack' if 'nack' in event else None
    print 'jarid: %s recv: %s %s: %s' % (jarid, event.get('recv'), ack_event, event.get(ack_event))

Upvotes: 0

John La Rooy
John La Rooy

Reputation: 304205

Based on JBernardo's answer, but using defaultdict instead of setdefault. You can print it exactly the same way, so I won't copy that code here

from collections import defaultdict
log = ['jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 recv: 1 timestamp: 00:00:02,217',
       'jarid: 7e5ae720-9151-11e0-eff2-00238bce4216 ack: 13 timestamp: 00:00:04,537',
       'jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 recv: 1 timestamp: 00:00:08,018',
       'jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 nack: 14 timestamp: 00:00:10,338']

d = defaultdict(dict)
for i in (line.split() for line in log):
    d[i[1]][i[2]] = i[-1]

You can also unpack into meaningful names. for example

for label1, jarid, jartype, x, label2, timestamp in (line.split() for line in log):
    d[jarid][jartype] = timestamp

Upvotes: 2

Andrew Clark
Andrew Clark

Reputation: 208485

Here is a regex solution:

import re
pattern = re.compile(r"""jarid:\s(\S+)       # save jarid to group 1
                         \s(recv:)\s\d+      # save 'recv:' to group 2
                         \stimestamp:\s(\S+) # save recv timestamp to group 3
                         .*?jarid:\s\1       # make sure next line has same jarid
                         \s(n?ack:)\s\d+     # save 'ack:' or 'nack:' to group 4
                         \stimestamp:\s(\S+) # save ack timestamp to group 5
                     """, re.VERBOSE | re.DOTALL | re.MULTILINE)

for content in pattern.finditer(log):
    print "    jarid: " + " ".join(content.groups())

Upvotes: 0

Bryan
Bryan

Reputation: 6699

I wouldn't make status a dictionary. Instead I would just store the timestamp for each status key in your jarid dictionary. Better explained with an example...

def search_jarids(jarid):
    stored_jarid = storage[jarid]
    entry = "jarid: %s" % jarid
    for status in stored_jarid:
        entry += " %s: %s" % (status, stored_jarid[status])
    return entry

with open("yourlog.log", 'r') as log:
    lines = log.readlines()

storage = {}

for line in lines:
    line = line.strip()
    jarid_tag, jarid, status_tag, status, timestamp_tag, timestamp = line.split(" ")

    if jarid not in storage:
        storage[jarid] = {}

    status_tag = status_tag[:-1]
    storage[jarid][status_tag] = timestamp

print search_jarids("462c6d11-9151-11e0-a72c-00238bbdc9e7")

Would give you:

jarid: 462c6d11-9151-11e0-a72c-00238bbdc9e7 nack: 00:00:10,338 recv: 00:00:08,018

Hope it gets you started.

Upvotes: 0

Related Questions