Reputation: 4188
I am writing a script a in python to parse ldap logs and then get the number of searches/binds by each user. I was testing my code on sample files and for smaller files till size of 5-10MB it runs quick and completes within a 1 minute on my local PC. However when i ran the script on a file worth 18M having around 150000 lines in it, it takes around 5 minutes, i want to run this script on file sizes of 100M and maybe be 5-6 files in each run so that means script has to parse almost of 600-700M of data in each run. But i suppose it would take long time to run, so i would need some advise from you guys if my below code can be fine tuned for better performance in terms of execution time.
import os,re,datetime
from collections import defaultdict
d=defaultdict(list)
k=defaultdict(list)
start_time=datetime.datetime.now()
fh = open("C:\\Rohit\\ECD Utilization Script - Copy\\logdir\\access","r").read()
pat=re.compile(' BIND REQ .*conn=([\d]*).*dn=(.*")')
srchStr='\n'.join(re.findall(r' SEARCH REQ .*',fh))
bindlist=re.findall(pat,fh)
for entry in bindlist:
d[entry[-1].split(",")[0]].append(entry[0])
for key in d:
for con in d[key]:
count = re.findall(con,srchStr)
k[key].append((con,len(count)))
#
for key in k:
print("Number of searches by ",key, " : ",sum([i[1] for i in k[key]]))
for key in d:
print("No of bind ",key," = ",len(d[key]))
end_time=datetime.datetime.now()
print("Total time taken - {}".format(end_time-start_time))
Upvotes: 0
Views: 5994
Reputation: 4188
I was able to solve my problem with below code.
import os,re,datetime
from collections import defaultdict
start_time=datetime.datetime.now()
bind_count=defaultdict(int)
search_conn=defaultdict(int)
bind_conn=defaultdict(str)
j=defaultdict(int)
fh = open("C:\\access","r")
total_searches=0
total_binds=0
for line in fh:
reg1=re.search(r' BIND REQ .*conn=(\d+).*dn=(.*")', line)
reg2=re.search(r' SEARCH REQ .*conn=(\d+).*', line)
if reg1:
total_binds+=1
uid,con=reg1.group(2,1)
bind_count[uid]=bind_count[uid]+1
bind_conn[con]=uid
if reg2:
total_searches+=1
skey=reg2.group(1)
search_conn[skey] = search_conn[skey]+1
for conid in search_conn:
if conid in bind_conn:
new_key=bind_conn[conid]
j[new_key]=j[new_key]+search_conn[conid]
for k,v in bind_count.items():
print(k," = ",v)
print("*"*80)
for k,v in j.items():
print(k,"-->",v)
fh.close()
del search_conn
del bind_conn
end_time=datetime.datetime.now()
print("Total time taken - {}".format(end_time-start_time))
Upvotes: 0
Reputation: 1952
You are doing several scans on entire file on the line
count = re.findall('SEARCH REQ.*'+conid,fh1)
Avoid this. This is your major problem. Get all conids in a list and iterate on file again and list while your inner loop should consist of conids. Bring it out of outer loop. You will be doing two scans of file.
Also since it is plain Python run with PyPy for faster runs.
You can do this better with a FSM and by spending a bit more RAM. This is a hint and you will have to do your FSM yourself.
Edit 1: This is the version of script I wrote after seeing the log file. Please correct if there is any mistake:
#!/usr/bin/env python
import sys
import re
def parse(filepath):
d = {}
regex1 = re.compile(r'(.*)?BIND\sREQ(.*)uid=(\w+)')
regex2 = re.compile(r'(.*)?SEARCH\sREQ(.*)uid=(\w+)')
with open(filepath, 'r') as f:
for l in f:
m = re.search(regex1, l)
if m:
# print (m.group(3))
uid = m.group(3)
if uid in d:
d[uid]['bind_count'] += 1
else:
d[uid] = {}
d[uid]['bind_count'] = 1
d[uid]['search_count'] = 0
m = re.search(regex2, l)
if m:
# print (m.group(3))
uid = m.group(3)
if uid in d:
d[uid]['search_count'] += 1
else:
d[uid] = {}
d[uid]['search_count'] = 1
d[uid]['bind_count'] = 0
for k in d:
print('user id = ' + k, 'Bind count = ' + str(d[k]['bind_count']), 'Search count = ' + str(d[k]['search_count']))
def process_args():
if sys.argv < 2:
print('Usage: parse_ldap_log.py log_filepath')
exit(1)
if __name__ == '__main__':
process_args()
parse(sys.argv[1])
Thank the Gods that it was not complicated enough to warrant an FSM.
Upvotes: 1
Reputation: 2263
Your script has a quadratic complexity: for each line in the file you are making a read again to match the log entry. My suggestion is to read the file only one time and counting the occurrences of the needed entry (the one matching (" BIND REQ ")).
Upvotes: 0