sfactor
sfactor

Reputation: 13062

Parsing log files to find related events in python

I have a log file that I need to parse to find whether a certain event is followed by another related event or not. Essentially whether the first event is alone or has a associated pair event. For example the data could is of the form:

Timestamp         Event        Property1        Property2      Property3
1445210282416     E1             A               1               Type1   *
1445210282434     F1             D               3               Type10      
1445210282490     E1             C               5               Type2
1445210282539     E2             A               1               Type1   *
1445210282943     F1             D               1               Type15 
1445210285452     E2             C               4               Type3

This is a simplified example but is essentially the same as the data file. We are trying to find if an event E1 has a corresponding event E2 for which Property1, Property2 and Property3 but be equal like in the two events with * shown. The second E1 event (row 3) doesn't have a corresponding E2 event. I also need to keep count of such events with no pairs corresponding to Property3 as key for later usage.

The files can be quite large (around 1 GB) and should avoid having the whole file in memory at the same time. So, I figured I could use a generator.

A initial attempt from is:

with open(filename, 'rb') as f:
    finding_pair = 0      # indicator to help determine what to do in a line of the file
    e1 = {}               # store the E1 row whose pair we want to find
    without_pair = {}     # store count of E1 events with no pair

    line = csv.DictReader((line for line in f), delimiter = ' ')

    for l in line:
        if l['Event'] = E1 and finding_pair = 0:  # find pair for this  
           // Go through file after this line to find E2 event.
           e1 = l
           finding_pair = 1
        elif (l['Event'] = E1 or l['Event'] = F1) and finding_pair = 1: # skip this and keep finding pair   
            continue
        elif l['Event'] = E2 and finding_pair = 1: # see if this is a pair
            if l['Property1'] == e1['Property1'] and l['Property2'] == e1['Property2'] and l['Property3'] == e1['Property3']:
                # pair found
                finding_pair = 0
                // Go to next E1 line ??
            else:
               # pair not found
               without_pair['Property3'] += 1
               // Go to next E1 line ??

So, my questions are:

Upvotes: 1

Views: 408

Answers (1)

Kaz
Kaz

Reputation: 58627

Solution in TXR

The script: based on copying data to pair.txr and editing to add in the extraction and output directives.

$ cat pair.txr
Timestamp         Event        Property1        Property2      Property3
@ts1 E1 @p1 @p2 @p3
@(skip)
@(line ln)
@ts2 @e2 @p1 @p2 @p3
@(output)
Duplicate of E1 found at line @ln: event @e2 timestamp @ts2.
@(end)

Run:

$ txr pair.txr data
Duplicate of E1 found at line 5: event E2 timestamp 1445210282539.

Run on some nonmatching data:

$ txr pair.txr /etc/motd   # failed termination status
$ echo $?
1

Data is:

$ cat data
Timestamp         Event        Property1        Property2      Property3
1445210282416     E1             A               1               Type1
1445210282434     F1             D               3               Type10
1445210282490     E1             C               5               Type2
1445210282539     E2             A               1               Type1
1445210282943     F1             D               1               Type15
1445210285452     E2             C               4               Type3

If it is a constraint that the second event must specifically have the name E2, then we can simply replace the e2 variable with the literal text E2.

If you know that the duplicate must occur within, say 100 lines, you can use @(skip 100). That can prevent time from being wasted scanning large files where there is no duplicate. Of course, the 100 doesn't have to be constant; it can be computed. If there are multiple duplicates, @(skip :greedy) will find the last duplicate.

Note that even though @(line ln) is on a line by itself, it has the semantics of not consuming a line. It binds the ln variable to the current line number in the input, but doesn't advance to the next line so that the subsequent line of the pattern language is applied to the same line. Thus ln indicates the line where that pattern matches.

Now, let's do something interesting: let's use variables for E1 and the second event. Moreover, let us not assume that the event to be matched is the first one:

Timestamp         Event        Property1        Property2      Property3
@(skip)
@ts1 @e1 @p1 @p2 @p3
@(skip)
@(line ln)
@ts2 @e2 @p1 @p2 @p3
@(output)
Duplicate of @e1 found at line @ln: event @e2 timestamp @ts2.
@(end)

As it stands, this code will now just find the first a pair in the data:

$ txr pair.txr data
Duplicate of E1 found at line 5: event E2 timestamp 1445210282539.

What we can do now is constrain the variables from the command line like this:

# Is there an E1 followed by a duplicate?
$ txr -De1=E1 pair.txr data 
Duplicate of E1 found at line 5: event E2 timestamp 1445210282539.

# Is there an E2 followed by a duplicate?
$ txr -De1=E2 pair.txr data 
$ echo $?
1

# Is there some event which is followed by a dupe called E2?
$ txr -De2=E2 pair.txr data 
Duplicate of E1 found at line 5: event E2 timestamp 1445210282539.

# Is there a pair of duplicates whose Property3 is Type1?
$ txr -Dp3=Type1 pair.txr data
Duplicate of E1 found at line 5: event E2 timestamp 1445210282539.

You get the picture.

Upvotes: 1

Related Questions