set() not removing duplicates

Question

I'm trying to find unique instances of IP addresses in a file using regex. I find them fine and try to append them to a list and later try to use set() on my list to remove duplicates. I'm finding each item okay and there are duplicates but I can't get the list to simplify. The output of printing my set is the same as printing ips as a list, nothing is removed.

ips = [] # make a list
count = 0
count1 = 0
for line in f: #loop through file line by line
    match = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", line) #find IPs
    if match: #if there's a match append and keep track of the total number of Ips
        ips.append(match) #append to list
        count = count + 1
ipset = set(ips)
print(ipset, count)

This string <_sre.SRE_Match object; span=(0, 13), match='137.43.92.119'> shows up 60+ times in the output before and after trying to set() the list

Martijn Pieters · Accepted Answer

You are not storing the matched strings. You are storing the re.Match objects. These don't compare equal even if they matched the same text, so they are all seen as unique by a set object:

>>> import re
>>> line = '137.43.92.119
'
>>> match1 = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", line)
>>> match1
<_sre.SRE_Match object; span=(0, 13), match='137.43.92.119'>
>>> match2 = re.search("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}", line)
>>> match2
<_sre.SRE_Match object; span=(0, 13), match='137.43.92.119'>
>>> match1 == match2
False

Extract the matched text instead:

ips.append(match.group()) #append to list

matchobj.group() without arguments returns the part of the string that was matched (group 0):

>>> match1.group()
'137.43.92.119'
>>> match1.group() == match2.group()
True

set() not removing duplicates

Answers (1)

Related Questions