Reputation: 101
I'm trying to do the following in Python, also using some bash scripting. Unless there is an easier way in Python.
I have a log file with data that looks like the following:
16:14:59.027003 - WARN - Cancel Latency: 100ms - OrderId: 311yrsbj - On Venue: ABCD
16:14:59.027010 - WARN - Ack Latency: 25ms - OrderId: 311yrsbl - On Venue: EFGH
16:14:59.027201 - WARN - Ack Latency: 22ms - OrderId: 311yrsbn - On Venue: IJKL
16:14:59.027235 - WARN - Cancel Latency: 137ms - OrderId: 311yrsbp - On Venue: MNOP
16:14:59.027256 - WARN - Cancel Latency: 220ms - OrderId: 311yrsbr - On Venue: QRST
16:14:59.027293 - WARN - Ack Latency: 142ms - OrderId: 311yrsbt - On Venue: UVWX
16:14:59.027329 - WARN - Cancel Latency: 134ms - OrderId: 311yrsbv - On Venue: YZ
16:14:59.027359 - WARN - Ack Latency: 75ms - OrderId: 311yrsbx - On Venue: ABCD
16:14:59.027401 - WARN - Cancel Latency: 66ms - OrderId: 311yrsbz - On Venue: ABCD
16:14:59.027426 - WARN - Cancel Latency: 212ms - OrderId: 311yrsc1 - On Venue: EFGH
16:14:59.027470 - WARN - Cancel Latency: 89ms - OrderId: 311yrsf7 - On Venue: IJKL
16:14:59.027495 - WARN - Cancel Latency: 97ms - OrderId: 311yrsay - On Venue: IJKL
I need to extract the last entry from each line and then use each unique entry and search for every line and that it appears in and export it to a .csv file.
I've used the following bash script to get each unique entry:
cat LogFile_date +%Y%m%d
.msg.log | awk '{print $14}' | sort | uniq
Based on the above data in the log file, the bash script would return the following results:
ABCD
EFGH
IJKL
MNOP
QRST
UVWX
YZ
Now I would like to search (or grep) for each of those results in the same log file and return the top ten results. I have another bash script to do this, however, HOW DO I DO THIS USING A FOR LOOP? So, for x, where x = each entry above,
grep x LogFile_date +%Y%m%d
.msg.log | awk '{print $7}' | sort -nr | uniq | head -10
Then return the results into a .csv file. The results would look like this (each field in a separate column):
Column-A Column-B Column-C Column-D
ABCD 2sxrb6ab Cancel 46ms
ABCD 2sxrb6af Cancel 45ms
ABCD 2sxrb6i2 Cancel 63ms
ABCD 2sxrb6i3 Cancel 103ms
EFGH 2sxrb6i4 Cancel 60ms
EFGH 2sxrb6i7 Cancel 60ms
IJKL 2sxrb6ie Ack 74ms
IJKL 2sxrb6if Ack 74ms
IJKL 2sxrb76s Cancel 46ms
MNOP vcxrqrs5 Cancel 7651ms
I'm a beginner in Python and haven't done much coding since college (13 years ago). Any help would be greatly appreciated. Thanks.
Upvotes: 1
Views: 2814
Reputation: 1367
Say you've opened your file. What you want to do is record how many times each individual entry is in there, which is to say, each entry will result in one or more timings:
from collections import defaultdict
entries = defaultdict(list)
for line in your_file:
# Parse the line and return the 'ABCD' part and time
column_a, timing = parse(line)
entries[column_a].append(timing)
When you're done, you have a dictionary like so:
{ 'ABCD': ['30ms', '25ms', '12ms'],
'EFGH': ['12ms'],
'IJKL': ['2ms', '14ms'] }
What you'll want to do now is transform this dictionary into another data structure ordered by len
of its value (which is a list). Example:
In [15]: sorted(((k, v) for k, v in entries.items()),
key=lambda i: len(i[1]), reverse=True)
Out[15]:
[('ABCD', ['30ms', '25ms', '12ms']),
('IJKL', ['2ms', '14ms']),
('EFGH', ['12ms'])]
Of course this is only illustrative and you might want to collect some more data in the original for
loop.
Upvotes: 1
Reputation: 1
Maybe not no concise as you might think ... But I think this can solve your problem. I add some try...catch to better address real data.
import re
import os
import csv
import collections
# get all logfiles under current directory of course this pattern can be more
# sophisticated, but it's not our attention here, isn't it?
log_pattern = re.compile(r"LogFile_date[0-9]{8}.msg.log")
logfiles = [f for f in os.listdir('./') if log_pattern.match(f)]
# top n
nhead = 10
# used to parse useful fields
extract_pattern = re.compile(
r'.*Cancel Latency: ([0-9]+ms) - OrderId: ([0-9a-z]+) - On Venue: ([A-Z]+)')
# container for final results
res = collections.defaultdict(list)
# parse out all interesting fields
for logfile in logfiles:
with open(logfile, 'r') as logf:
for line in logf:
try: # in case of blank line or line with no such fields.
latency, orderid, venue = extract_pattern.match(line).groups()
except AttributeError:
continue
res[venue].append((orderid, latency))
# write to csv
with open('res.csv', 'w') as resf:
resc = csv.writer(resf, delimiter=' ')
for venue in sorted(res.iterkeys()): # sort by Venue
entries = res[venue]
entries.sort() # sort by OrderId
for i in range(0, nhead):
try:
resc.writerow([venue, entries[i][0], 'Cancel ' + entries[i][1]])
except IndexError: # nhead can not be satisfied
break
Upvotes: 0