Reputation: 23
from the given log file, I need to find which URLs were not found (404) Sample data from log file is:
Entry 1:
443623565414391809 2014-09-02T14:09:36 2014-09-03T00:48:42Z 4147981 demo-workablehr 54.198.230.235 Local3 Info heroku/router at=info method=GET path="/api/accounts/3" host=workabledemo.com request_id=73ffd4fc-c86c-41ca-a737-91da110fbc39 fwd="50.31.164.139" dyno=web.2 connect=5ms service=17ms status=404 bytes=444
Entry 2:
443623565414391810 2014-09-02T14:10:27 2014-09-03T00:48:42Z 4147981 demo-workablehr 54.198.230.235 Local7 Info app/web.2 [e1af99e5-64b4-4228-8e23-d9b6bab84f80] [VISITOR #NEW] [GUEST] [1m[35mAccount Load (1.2ms)[0m SELECT "accounts".* FROM "accounts" WHERE (accounts.approval_status != 'blocked') AND "accounts"."id" = 3 LIMIT 1
Here the block () is 'ESC'
I understand that I need to open a file and read the content and look for status=404 How can I do this using python3 and the number of entries in the file are 30,000+
I tried this:
count404 = 0
with open('C:\\Users\\Zee\\Downloads\\testLog.txt','r') as f:
for line in f:
for word in line.split():
count404 += 1
print(count404)
I am wondering if there is a better approach and if I take this approach then how to find the list of URLs which have status = 404
I am fairly new to python and to StackOverflow. Thanks in advance
Upvotes: 0
Views: 1282
Reputation: 458
As pointed out in comments, regex is your best friend here, here is the sample approach
import re
count=0
fl=open('C:\\Users\\Zee\\Downloads\\testLog.txt','r')
fillines=fl.readlines()
fl.close()
for i in fillines:
if re.search(r'status=404',i):
count+=1
to make a list of all the paths in logs having status as 404 we can again use regex
import re
count=0
lst=[]
fl=open('C:\\Users\\Zee\\Downloads\\testLog.txt','r')
fillines=fl.readlines()
fl.close()
for i in fillines:
if re.search(r'status=404',i):
count+=1
path=re.search(r'path="[/\w+/]+"',fillines[0]).group(0)#get path using regex
path=path.split("path=")[1] #since we only want the url
path=path.replace('"','') #we dont want the quotes in log
lst.append(path)#since we only want the url
Upvotes: 1