Parse wget log file in python

Question

I have a wget log file and would like to parse the file so that I can extract relevant info for each log entry. E.g IP address, timestamp, URL, etc.

A sample log file is printed below. The number of lines and detail of information is not identical for each entry. What is consistent is the notation of each line.

I am able to extract individual lines but I want a multidimensional array (or similar):

import re

f = open('c:/r1/log.txt', 'r').read()


split_log =  re.findall('--[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}.*', f)

print split_log

print len(split_log)

for element in split_log:
    print(element)


####### Start log file example

2014-11-22 10:51:31 (96.9 KB/s) - `C:/r1/www.itb.ie/AboutITB/index.html' saved [13302]

--2014-11-22 10:51:31--  http://www.itb.ie/CurrentStudents/index.html
Connecting to www.itb.ie|193.1.36.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/CurrentStudents/index.html'

     0K .......... .......                                      109K=0.2s

2014-11-22 10:51:31 (109 KB/s) - `C:/r1/www.itb.ie/CurrentStudents/index.html' saved [17429]

--2014-11-22 10:51:32--  h ttp://www.itb.ie/Vacancies/index.html
Connecting to www.itb.ie|193.1.36.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/Vacancies/index.html'

     0K .......... .......... ..                                118K=0.2s

2014-11-22 10:51:32 (118 KB/s) - `C:/r1/www.itb.ie/Vacancies/index.html' saved [23010]

--2014-11-22 10:51:32--  h ttp://www.itb.ie/Location/howtogetthere.html
Connecting to www.itb.ie|193.1.36.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/Location/howtogetthere.html'

     0K .......... .......                                      111K=0.2s

PM 2Ring · Accepted Answer

Here's how you can extract the data you want and store it in a list of tuples.

The regexes I've used here aren't perfect, but they work ok with your sample data. I modified your original regex to use the more readable \d instead of the equivalent [0-9]. I've also used raw strings, which generally makes working with regexes easier.

I've embedded your log data into my code as a triple-quoted string so I don't have to worry about file handling. I noticed that there are spaces in some of the URLs in your log file, eg

h ttp://www.itb.ie/Vacancies/index.html

but I assume that those spaces are an artifact of copy & pasting and they don't actually exist in the real log data. If that's not the case, then your program will need to do extra work to cope with such extraneous spaces.

I've also modified the IP addresses in the log data, so they aren't all identical, just to make sure that each IP found by findall is getting properly associated with the correct timestamp & URL.

#! /usr/bin/env python

import re

log_lines = '''

2014-11-22 10:51:31 (96.9 KB/s) - `C:/r1/www.itb.ie/AboutITB/index.html' saved [13302]

--2014-11-22 10:51:31--  http://www.itb.ie/CurrentStudents/index.html
Connecting to www.itb.ie|193.1.36.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/CurrentStudents/index.html'

     0K .......... .......                                      109K=0.2s

2014-11-22 10:51:31 (109 KB/s) - `C:/r1/www.itb.ie/CurrentStudents/index.html' saved [17429]

--2014-11-22 10:51:32--  http://www.itb.ie/Vacancies/index.html
Connecting to www.itb.ie|193.1.36.25|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/Vacancies/index.html'

     0K .......... .......... ..                                118K=0.2s

2014-11-22 10:51:32 (118 KB/s) - `C:/r1/www.itb.ie/Vacancies/index.html' saved [23010]

--2014-11-22 10:51:32--  http://www.itb.ie/Location/howtogetthere.html
Connecting to www.itb.ie|193.1.36.26|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/Location/howtogetthere.html'

     0K .......... .......                                      111K=0.2s
'''

time_and_url_pat = re.compile(r'--(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})--\s+(.*)')
ip_pat = re.compile(r'Connecting to.*\|(.*?)\|')

time_and_url_list = time_and_url_pat.findall(log_lines)
print '
time and url
', time_and_url_list

ip_list = ip_pat.findall(log_lines)
print '
ip
', ip_list

all_data = [(t, u, i) for (t, u), i in zip(time_and_url_list, ip_list)]
print '
all
', all_data, '
'

for t in all_data:
    print t

output

time and url
[('2014-11-22 10:51:31', 'http://www.itb.ie/CurrentStudents/index.html'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Vacancies/index.html'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Location/howtogetthere.html')]

ip
['193.1.36.24', '193.1.36.25', '193.1.36.26']

all
[('2014-11-22 10:51:31', 'http://www.itb.ie/CurrentStudents/index.html', '193.1.36.24'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Vacancies/index.html', '193.1.36.25'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Location/howtogetthere.html', '193.1.36.26')] 

('2014-11-22 10:51:31', 'http://www.itb.ie/CurrentStudents/index.html', '193.1.36.24')
('2014-11-22 10:51:32', 'http://www.itb.ie/Vacancies/index.html', '193.1.36.25')
('2014-11-22 10:51:32', 'http://www.itb.ie/Location/howtogetthere.html', '193.1.36.26')

This last part of this code uses a list comprehension to reorganize the data in the time_and_url_list and the ip_list into a single list of tuples, using the zip built-in function to process the two lists in parallel. If that part's a bit hard to follow, please let me know & I'll try to explain it further.

Parse wget log file in python

Answers (1)

Related Questions