Reputation: 425
I'm new to Python and have complied a list of items from a file that has the an element which appeared in the file and its frequency in the file like this
('95.108.240.252', 9)
its mostly IP addresses I'm gathering. I'd like to output the address and frequency like this instead
IP Frequency
95.108.240.252 9
I'm trying to do this by regexing the list item and printing that but it returns the following error when I try TypeError: expected string or bytes-like object
This is the code I'm using to do all the now:
ips = [] # IP address list
for line in f:
match = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", line) # Get all IPs line by line
if match:
ips.append(match.group()) # if found add to list
from collections import defaultdict
freq = defaultdict( int )
for i in ips:
freq[i] += 1 # get frequency of IPs
print("IP\t\t Frequency") # Print header
freqsort = sorted(freq.items(), reverse = True, key=lambda item: item[1]) # sort in descending frequency
for c in range(0,4): # print the 4 most frequent IPs
# print(freqsort[c]) # This line prints the item like ('95.108.240.252', 9)
m1 = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", freqsort[c]) # This is the line returning errors - trying to parse IP on its own from the list
print(m1.group()) # Then print it
Not trying to even parse the frequency yet, just wanted the IPs as a starting point
Upvotes: 1
Views: 248
Reputation: 142126
You can use the ipaddress
and Counter
in the stdlib to assist with this...
from collections import Counter
from ipaddress import ip_address
with open('somefile.log') as fin:
ips = Counter()
for line in fin:
ip, rest_of_line = line.partition(' ')[::2]
try:
ips[ip_address(ip)] += 1
except ValueError:
pass
print(ips.most_common(4))
This'll also handle IPv4 and IPv6 style addresses and make sure they're technically correct not just "look" correct. Using a collections.Counter
also gives you a .most_common()
method to automatically sort by the most frequent and limit it to n amounts.
Upvotes: 0
Reputation: 11032
The second parameter in re.search()
should be string and you are passing tuple
. So it is generating an error saying that it expected string
or buffer
.
NOTE:- Also you need to make sure that there at least 4 elements for IP address, otherwise there will be index out of bounds
error
Delete the last two lines and use this instead
print(freqsort[c][0])
If you want to stick to your format you can use the following but it is of no use
m1 = re.search(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", freqsort[c][0]) # This is the line returning errors - trying to parse IP on its own from the list
print(m1.group())
Upvotes: 1
Reputation: 8978
Try regex with positive and negative lookaround.
(?<=\(\')(.*)(?=\').*(\d+)
First captured group will be your IP and second frequency.
Upvotes: 0
Reputation: 5207
Use a byte object instead:
# notice the `b` before the quotes.
match = re.search(b'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', line)
Upvotes: 1