zee
zee

Reputation: 23

python script to read a log file to determine the list of URLs that were not found (404)

from the given log file, I need to find which URLs were not found (404) Sample data from log file is:

Entry 1:

443623565414391809 2014-09-02T14:09:36 2014-09-03T00:48:42Z 4147981 demo-workablehr 54.198.230.235 Local3 Info heroku/router at=info method=GET path="/api/accounts/3" host=workabledemo.com request_id=73ffd4fc-c86c-41ca-a737-91da110fbc39 fwd="50.31.164.139" dyno=web.2 connect=5ms service=17ms status=404 bytes=444

Entry 2:

443623565414391810 2014-09-02T14:10:27 2014-09-03T00:48:42Z 4147981 demo-workablehr 54.198.230.235 Local7 Info app/web.2 [e1af99e5-64b4-4228-8e23-d9b6bab84f80] [VISITOR #NEW] [GUEST] [1m[35mAccount Load (1.2ms)[0m SELECT "accounts".* FROM "accounts" WHERE (accounts.approval_status != 'blocked') AND "accounts"."id" = 3 LIMIT 1

Here the block () is 'ESC'

I understand that I need to open a file and read the content and look for status=404 How can I do this using python3 and the number of entries in the file are 30,000+

I tried this:

count404 = 0
with open('C:\\Users\\Zee\\Downloads\\testLog.txt','r') as f:
    for line in f:
        for word in line.split():
            count404 += 1
print(count404)

I am wondering if there is a better approach and if I take this approach then how to find the list of URLs which have status = 404

I am fairly new to python and to StackOverflow. Thanks in advance

Upvotes: 0

Views: 1282

Answers (1)

Shivam Chawla
Shivam Chawla

Reputation: 458

As pointed out in comments, regex is your best friend here, here is the sample approach

import re
count=0
fl=open('C:\\Users\\Zee\\Downloads\\testLog.txt','r')
fillines=fl.readlines()
fl.close()
for i in fillines:
  if re.search(r'status=404',i):
    count+=1

to make a list of all the paths in logs having status as 404 we can again use regex

import re
count=0
lst=[]
fl=open('C:\\Users\\Zee\\Downloads\\testLog.txt','r')
fillines=fl.readlines()
fl.close()
for i in fillines:
  if re.search(r'status=404',i):
    count+=1
    path=re.search(r'path="[/\w+/]+"',fillines[0]).group(0)#get path using regex
    path=path.split("path=")[1] #since we only want the url
    path=path.replace('"','') #we dont want the quotes in log
    lst.append(path)#since we only want the url

Upvotes: 1

Related Questions