BitFlow
BitFlow

Reputation: 425

regex to find match in element of list

I'm new to Python and have complied a list of items from a file that has the an element which appeared in the file and its frequency in the file like this

('95.108.240.252', 9)

its mostly IP addresses I'm gathering. I'd like to output the address and frequency like this instead

IP               Frequency
95.108.240.252   9

I'm trying to do this by regexing the list item and printing that but it returns the following error when I try TypeError: expected string or bytes-like object

This is the code I'm using to do all the now:

ips = [] # IP address list
for line in f:
    match = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", line) # Get all IPs line by line
    if match:
        ips.append(match.group()) # if found add to list

from collections import defaultdict
freq = defaultdict( int )
for i in ips:
    freq[i] += 1 # get frequency of IPs

print("IP\t\t  Frequency") # Print header

freqsort = sorted(freq.items(), reverse = True, key=lambda item: item[1]) # sort in descending frequency
for c in range(0,4): # print the 4 most frequent IPs
   # print(freqsort[c])  # This line prints the item like ('95.108.240.252', 9)
    m1 = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", freqsort[c]) # This is the line returning errors - trying to parse IP on its own from the list
    print(m1.group()) # Then print it

Not trying to even parse the frequency yet, just wanted the IPs as a starting point

Upvotes: 1

Views: 248

Answers (4)

Jon Clements
Jon Clements

Reputation: 142126

You can use the ipaddress and Counter in the stdlib to assist with this...

from collections import Counter
from ipaddress import ip_address

with open('somefile.log') as fin:
    ips = Counter()
    for line in fin:
        ip, rest_of_line = line.partition(' ')[::2]
        try:
            ips[ip_address(ip)] += 1
        except ValueError:
            pass

print(ips.most_common(4))

This'll also handle IPv4 and IPv6 style addresses and make sure they're technically correct not just "look" correct. Using a collections.Counter also gives you a .most_common() method to automatically sort by the most frequent and limit it to n amounts.

Upvotes: 0

rock321987
rock321987

Reputation: 11032

The second parameter in re.search() should be string and you are passing tuple. So it is generating an error saying that it expected string or buffer.

NOTE:- Also you need to make sure that there at least 4 elements for IP address, otherwise there will be index out of bounds error

Delete the last two lines and use this instead

print(freqsort[c][0])

If you want to stick to your format you can use the following but it is of no use

m1 = re.search(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", freqsort[c][0]) # This is the line returning errors - trying to parse IP on its own from the list
print(m1.group())

Upvotes: 1

Saleem
Saleem

Reputation: 8978

Try regex with positive and negative lookaround.

(?<=\(\')(.*)(?=\').*(\d+)

First captured group will be your IP and second frequency.

Upvotes: 0

RattleyCooper
RattleyCooper

Reputation: 5207

Use a byte object instead:

# notice the `b` before the quotes.
match = re.search(b'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', line)

Upvotes: 1

Related Questions