Reputation: 127
I have a wget log file and would like to parse the file so that I can extract relevant info for each log entry. E.g IP address, timestamp, URL, etc.
A sample log file is printed below. The number of lines and detail of information is not identical for each entry. What is consistent is the notation of each line.
I am able to extract individual lines but I want a multidimensional array (or similar):
import re
f = open('c:/r1/log.txt', 'r').read()
split_log = re.findall('--[0-9]{4}-[0-9]{2}-[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}.*', f)
print split_log
print len(split_log)
for element in split_log:
print(element)
####### Start log file example
2014-11-22 10:51:31 (96.9 KB/s) - `C:/r1/www.itb.ie/AboutITB/index.html' saved [13302]
--2014-11-22 10:51:31-- http://www.itb.ie/CurrentStudents/index.html
Connecting to www.itb.ie|193.1.36.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/CurrentStudents/index.html'
0K .......... ....... 109K=0.2s
2014-11-22 10:51:31 (109 KB/s) - `C:/r1/www.itb.ie/CurrentStudents/index.html' saved [17429]
--2014-11-22 10:51:32-- h ttp://www.itb.ie/Vacancies/index.html
Connecting to www.itb.ie|193.1.36.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/Vacancies/index.html'
0K .......... .......... .. 118K=0.2s
2014-11-22 10:51:32 (118 KB/s) - `C:/r1/www.itb.ie/Vacancies/index.html' saved [23010]
--2014-11-22 10:51:32-- h ttp://www.itb.ie/Location/howtogetthere.html
Connecting to www.itb.ie|193.1.36.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/Location/howtogetthere.html'
0K .......... ....... 111K=0.2s
Upvotes: 1
Views: 870
Reputation: 55469
Here's how you can extract the data you want and store it in a list of tuples.
The regexes I've used here aren't perfect, but they work ok with your sample data. I modified your original regex to use the more readable \d
instead of the equivalent [0-9]
. I've also used raw strings, which generally makes working with regexes easier.
I've embedded your log data into my code as a triple-quoted string so I don't have to worry about file handling. I noticed that there are spaces in some of the URLs in your log file, eg
h ttp://www.itb.ie/Vacancies/index.html
but I assume that those spaces are an artifact of copy & pasting and they don't actually exist in the real log data. If that's not the case, then your program will need to do extra work to cope with such extraneous spaces.
I've also modified the IP addresses in the log data, so they aren't all identical, just to make sure that each IP found by findall
is getting properly associated with the correct timestamp & URL.
#! /usr/bin/env python
import re
log_lines = '''
2014-11-22 10:51:31 (96.9 KB/s) - `C:/r1/www.itb.ie/AboutITB/index.html' saved [13302]
--2014-11-22 10:51:31-- http://www.itb.ie/CurrentStudents/index.html
Connecting to www.itb.ie|193.1.36.24|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/CurrentStudents/index.html'
0K .......... ....... 109K=0.2s
2014-11-22 10:51:31 (109 KB/s) - `C:/r1/www.itb.ie/CurrentStudents/index.html' saved [17429]
--2014-11-22 10:51:32-- http://www.itb.ie/Vacancies/index.html
Connecting to www.itb.ie|193.1.36.25|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/Vacancies/index.html'
0K .......... .......... .. 118K=0.2s
2014-11-22 10:51:32 (118 KB/s) - `C:/r1/www.itb.ie/Vacancies/index.html' saved [23010]
--2014-11-22 10:51:32-- http://www.itb.ie/Location/howtogetthere.html
Connecting to www.itb.ie|193.1.36.26|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: ignored [text/html]
Saving to: `C:/r1/www.itb.ie/Location/howtogetthere.html'
0K .......... ....... 111K=0.2s
'''
time_and_url_pat = re.compile(r'--(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})--\s+(.*)')
ip_pat = re.compile(r'Connecting to.*\|(.*?)\|')
time_and_url_list = time_and_url_pat.findall(log_lines)
print '\ntime and url\n', time_and_url_list
ip_list = ip_pat.findall(log_lines)
print '\nip\n', ip_list
all_data = [(t, u, i) for (t, u), i in zip(time_and_url_list, ip_list)]
print '\nall\n', all_data, '\n'
for t in all_data:
print t
output
time and url
[('2014-11-22 10:51:31', 'http://www.itb.ie/CurrentStudents/index.html'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Vacancies/index.html'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Location/howtogetthere.html')]
ip
['193.1.36.24', '193.1.36.25', '193.1.36.26']
all
[('2014-11-22 10:51:31', 'http://www.itb.ie/CurrentStudents/index.html', '193.1.36.24'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Vacancies/index.html', '193.1.36.25'), ('2014-11-22 10:51:32', 'http://www.itb.ie/Location/howtogetthere.html', '193.1.36.26')]
('2014-11-22 10:51:31', 'http://www.itb.ie/CurrentStudents/index.html', '193.1.36.24')
('2014-11-22 10:51:32', 'http://www.itb.ie/Vacancies/index.html', '193.1.36.25')
('2014-11-22 10:51:32', 'http://www.itb.ie/Location/howtogetthere.html', '193.1.36.26')
This last part of this code uses a list comprehension to reorganize the data in the time_and_url_list and the ip_list into a single list of tuples, using the zip
built-in function to process the two lists in parallel. If that part's a bit hard to follow, please let me know & I'll try to explain it further.
Upvotes: 1