Reputation: 507
I have a small (40mb) server log, linked here
I have a regular expression which I'm using to parse through the code that takes an INCREDIBLY (5+ minute) time to get through. I'm relatively new to regex, so I'm not sure why this would take so long for such a small file
here's the expression:
valid=re.findall(r'(\d+/[a-zA-Z]+/\d+).*?(GET|POST)\s+(http://|https//)([a-zA-Z]+.+?)\.[^/].*?\.([a-zA-Z]+)(/|\s|:).*?\s200\s', line)
things really started to chug when I added the "200" at the end of the line
and here's the entire code:
import re
#todo
#specify toplevel domain lookback
######
fhandle=open("access_log.txt", "rU")
access_log=fhandle.readlines()
validfile=open("valid3.txt", "w")
invalidfile=open("invalid3.txt", "w")
valid_dict=dict()
invalid_list=list()
valid_list=list()
#part 1
#read file. apply regex and append into internal data structure (a 2d dictionary)
for line in access_log:
valid=re.findall(r'(\d+/[a-zA-Z]+/\d+).*?(GET|POST)\s+(http://|https//)([a-zA-Z]+.+?)\.[^/].*?\.([a-zA-Z]+)(/|\s|:).*?\s200\s', line)
#valid=re.findall(r'(\d+/[a-zA-Z]+/\d+).*?(GET|POST)\s+(http://|https://)([a-zA-Z]+.+?)\.[^/].*?\.([a-zA-Z]+)(/|\s|:).*?\s+200\s', line)
if valid:
date=valid[0][0]
domain=valid[0][4].lower()
valid_list.append(line)
#writes results into 2d dcitonary (dictionary of dictonaries)
if date not in valid_dict:
valid_dict[date]={}
else:
if domain in valid_dict[date]:
valid_dict[date][domain]+=1
else:
valid_dict[date][domain]=1
#writes remainder files into invalid file log
else:
invalid_list.append(line)
#step 2
#format output file for tsv
#ordered chronologically, with Key:Value pairs orgainzed alphabeticallv by key (Domain Name)
date_workspace=''
domain_workspace=''
for date in sorted(valid_dict.iterkeys()):
date_workspace+=date + "\t"
for domain_name in sorted(valid_dict[date].iterkeys()):
domain_workspace+="%s:%s\t" % (domain_name, valid_dict[date][domain_name])
date_workspace+=domain_workspace
date_workspace+="\n"
domain_workspace=''
# Step 3
# write output
validfile.write(date_workspace)
for line in invalid_list:
invalidfile.write(line)
fhandle.close()
validfile.close()
invalidfile.close()
Upvotes: 0
Views: 76
Reputation: 89557
Assuming that you want to keep the domain name extension, you can change the regex part of your code like this:
pattern = re.compile(r'^[^[]+\[(\d+/[a-zA-Z]+/\d+)[^]]+] "(?:GET|POST) https?://[a-zA-Z]+[^?/\s]*\.([a-zA-Z]+)[?/ :][^"]*" 200 ')
for line in access_log:
valid=pattern.search(line)
if valid:
date=valid.group(1)
domain=valid.group(2).lower()
valid_list.append(line)
Improvements: 5min -> 2s
Since you read the file line by line there is only one possible match in a line, it is better to use re.search
that returns the first match instead of re.findall
.
The pattern is used once per line, this is why I have choosen to compile the pattern before the loop.
The pattern is now anchored with the start of the string anchor ^
and the begining of the line is now described with [^[]+\[
(all that is not a [
one or more times followed by a [
). This improvement is very important since it avoids the regex engine to try the start of the pattern at each character of the line.
All .*?
are slow for two reasons (at least):
a lazy quantifier must test if the following subpattern matches for each character.
if the pattern fails later, since .*?
can match all characters, the regex engine doesn't have the smallest reason to stop its backtracking. In other words, the good way is to be as explicit as possible.
To do that you must replace all .*?
with a negative character class and a greedy quantifier.
All uneeded capturing groups have been replaced with a non capturing group (?:...)
.
Some other trivial changes have been made like (http://|https://) => https?://
or (/|\s|:) => [?/ :]
. All \s+
have been replaced with a space.
As an aside comment, I am sure that there is a lot of log parser/analyser for python that can help you. Note too that your log file uses a csv format.
Upvotes: 1