Reputation: 13
My program reads in a large log file. It then searches the file for the IP and TIME(whatever is in the brackets).
5.63.145.71 - - [30/Jun/2013:08:04:46 -0500] "HEAD / HTTP/1.1" 200 - "-" "checks.panopta.com" 5.63.145.71 - - [30/Jun/2013:08:04:49 -0500] "HEAD / HTTP/1.1" 200 - "-" "checks.panopta.com" 5.63.145.71 - - [30/Jun/2013:08:04:51 -0500] "HEAD / HTTP/1.1" 200 - "-" "checks.panopta.com"
I want to read the whole file, and summarize the entries as follows:
Num 3 IP 5.63.145.1 TIME [30/Jun/2013:08:04:46 -0500] Number of entries, IP, TIME and DATE
What I have so far:
import re
x = open("logssss.txt")
dic={}
for line in x:
m = re.search(r"\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b",line).group().split()
c = re.search(r"\[(.+)\]",line).group().split()
for i in range(len(m)):
try:
dic[m[i]] += 1
except:
dic[m[i]] = 1
k = dic.keys()
for i in range(len(k)):
print dic[k[i]], k[i]
The above code displays correctly now! Thanks.
6 199.21.99.83
1 5.63.145.71
EDIT: So how about adding c into my output now, the timestamps are going to differ obviously, but just getting one of the values, on the same line, is that possible?
Upvotes: 1
Views: 132
Reputation: 39406
You could use a Counter
which is much more efficient:
from collections import Counter
cnt = Counter()
for line in x:
m = re.search(r"\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b",line).group().split()
cnt.update(m)
Then the printing done outside the main loop :
for k,v in cnt.iteritems():
print k, v
to include c, a defaultdict
would be more appropriate:
dict = defaultdict(list)
for line in x:
m = re.search(r"\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b",line).group().split()[0]
c = re.search(r"\[(.+)\]",line).group().split()[0]
dict[m].append(c)
for k,v in dict.iteritems():
print k, len(v), v
It is my understanding that there is only 1 ip and date per line, hence the [0]
to take the first and only occurence.
Upvotes: 2
Reputation: 44093
Move your print statement outside of the main loop
import re
x = open("logssss.txt")
dic={}
for line in x:
m = re.search(r"\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b",line).group().split()
c = re.search(r"\[(.+)\]",line).group().split()
for i in range(len(m)):
try:
dic[m[i]] += 1
except:
dic[m[i]] = 1
for k,v in dic.iteritems(): #or items if Python 3.X
print k, v
As a tip you could take advantage of pythons Counter
class to replace your try except block
from collections import Counter
dic = Counter()
for line in x:
m = re.search(r"\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b",line).group().split()
c = re.search(r"\[(.+)\]",line).group().split()
for i in range(len(m)):
dic[m[i]] += 1
for k,v in dic.iteritems(): #or items if Python 3.X
print k, v
From your comment, I would just use a dictionary of lists, the count for each ip address could be extracted from the length of the list:
dic = {}
for line in x:
m = re.search(r"\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b",line).group().split()
c = re.search(r"\[(.+)\]",line).group().split()
for i in range(len(m)):
dic.setdefault(m[i], []).append(c)
for k,v in dic.iteritems(): #or items if Python 3.X
print k, len(v), v
Upvotes: 3