Reputation: 13062
I have a log file that I need to parse to find whether a certain event is followed by another related event or not. Essentially whether the first event is alone or has a associated pair event. For example the data could is of the form:
Timestamp Event Property1 Property2 Property3
1445210282416 E1 A 1 Type1 *
1445210282434 F1 D 3 Type10
1445210282490 E1 C 5 Type2
1445210282539 E2 A 1 Type1 *
1445210282943 F1 D 1 Type15
1445210285452 E2 C 4 Type3
This is a simplified example but is essentially the same as the data file. We are trying to find if an event E1
has a corresponding event E2
for which Property1
, Property2
and Property3
but be equal like in the two events with * shown. The second E1
event (row 3) doesn't have a corresponding E2
event. I also need to keep count of such events with no pairs corresponding to Property3 as key for later usage.
The files can be quite large (around 1 GB) and should avoid having the whole file in memory at the same time. So, I figured I could use a generator.
A initial attempt from is:
with open(filename, 'rb') as f:
finding_pair = 0 # indicator to help determine what to do in a line of the file
e1 = {} # store the E1 row whose pair we want to find
without_pair = {} # store count of E1 events with no pair
line = csv.DictReader((line for line in f), delimiter = ' ')
for l in line:
if l['Event'] = E1 and finding_pair = 0: # find pair for this
// Go through file after this line to find E2 event.
e1 = l
finding_pair = 1
elif (l['Event'] = E1 or l['Event'] = F1) and finding_pair = 1: # skip this and keep finding pair
continue
elif l['Event'] = E2 and finding_pair = 1: # see if this is a pair
if l['Property1'] == e1['Property1'] and l['Property2'] == e1['Property2'] and l['Property3'] == e1['Property3']:
# pair found
finding_pair = 0
// Go to next E1 line ??
else:
# pair not found
without_pair['Property3'] += 1
// Go to next E1 line ??
So, my questions are:
Upvotes: 1
Views: 408
Reputation: 58627
Solution in TXR
The script: based on copying data
to pair.txr
and editing to add in the extraction and output directives.
$ cat pair.txr
Timestamp Event Property1 Property2 Property3
@ts1 E1 @p1 @p2 @p3
@(skip)
@(line ln)
@ts2 @e2 @p1 @p2 @p3
@(output)
Duplicate of E1 found at line @ln: event @e2 timestamp @ts2.
@(end)
Run:
$ txr pair.txr data
Duplicate of E1 found at line 5: event E2 timestamp 1445210282539.
Run on some nonmatching data:
$ txr pair.txr /etc/motd # failed termination status
$ echo $?
1
Data is:
$ cat data
Timestamp Event Property1 Property2 Property3
1445210282416 E1 A 1 Type1
1445210282434 F1 D 3 Type10
1445210282490 E1 C 5 Type2
1445210282539 E2 A 1 Type1
1445210282943 F1 D 1 Type15
1445210285452 E2 C 4 Type3
If it is a constraint that the second event must specifically have the name E2
, then we can simply replace the e2
variable with the literal text E2
.
If you know that the duplicate must occur within, say 100 lines, you can use @(skip 100)
. That can prevent time from being wasted scanning large files where there is no duplicate. Of course, the 100 doesn't have to be constant; it can be computed. If there are multiple duplicates, @(skip :greedy)
will find the last duplicate.
Note that even though @(line ln)
is on a line by itself, it has the semantics of not consuming a line. It binds the ln
variable to the current line number in the input, but doesn't advance to the next line so that the subsequent line of the pattern language is applied to the same line. Thus ln
indicates the line where that pattern matches.
Now, let's do something interesting: let's use variables for E1 and the second event. Moreover, let us not assume that the event to be matched is the first one:
Timestamp Event Property1 Property2 Property3
@(skip)
@ts1 @e1 @p1 @p2 @p3
@(skip)
@(line ln)
@ts2 @e2 @p1 @p2 @p3
@(output)
Duplicate of @e1 found at line @ln: event @e2 timestamp @ts2.
@(end)
As it stands, this code will now just find the first a pair in the data:
$ txr pair.txr data
Duplicate of E1 found at line 5: event E2 timestamp 1445210282539.
What we can do now is constrain the variables from the command line like this:
# Is there an E1 followed by a duplicate?
$ txr -De1=E1 pair.txr data
Duplicate of E1 found at line 5: event E2 timestamp 1445210282539.
# Is there an E2 followed by a duplicate?
$ txr -De1=E2 pair.txr data
$ echo $?
1
# Is there some event which is followed by a dupe called E2?
$ txr -De2=E2 pair.txr data
Duplicate of E1 found at line 5: event E2 timestamp 1445210282539.
# Is there a pair of duplicates whose Property3 is Type1?
$ txr -Dp3=Type1 pair.txr data
Duplicate of E1 found at line 5: event E2 timestamp 1445210282539.
You get the picture.
Upvotes: 1