Reputation: 534
I have a long list 100k+ of IP addresses in a certain range, an example from this script is:
67.0.105.76 0
67.0.123.150 0
67.0.123.150 0
67.0.123.150 0
67.0.123.150 0
67.0.123.150 0
67.0.123.150 0
67.0.123.150 0
67.0.123.150 0
67.0.105.76 0
67.0.123.150 0
67.0.163.127 0
67.0.123.150 0
67.0.163.127 0
67.0.163.127 0
67.0.163.127 0
67.0.163.127 0
67.0.163.127 0
67.0.163.127 0
67.0.163.127 0
67.0.163.127 0
67.0.163.127 0
67.0.105.76 0
67.0.105.76 0
67.0.105.76 0
67.0.232.158 0
67.0.232.158 0
67.0.232.158 0
67.0.232.158 0
67.0.232.158 0
67.0.232.158 0
67.0.232.158 0
67.0.232.158 0
67.0.232.158 0
67.0.232.158 0
67.0.105.76 0
67.0.143.13 0
From this list I am wanting to remove any IP's that are not listed multiple times, so say I would like to remove all ips from the list above that aren't listed 5 or more times. It would then output:
67.0.105.76 0
67.0.123.150 0
67.0.163.127 0
67.0.232.158 0
I've tried to accomplish this using sed/uniq in Linux but wasn't able to find a way to do this, would a python script or such be needed for this or is there a possible way using sed/uniq?
Using sort -u 100kfile, it was able to remove all duplicates, but it was still left with the single ip's.
Upvotes: 1
Views: 183
Reputation: 41446
Here is a simple way to do it in awk
awk '{a[$0]++} END {for (i in a) if (a[i]>4) print i}' file
67.0.232.158 0
67.0.105.76 0
67.0.163.127 0
67.0.123.150 0
Count every unique IP and store the number in an array a
If there are more then 4
hit, print it.
It should be faster then the sort
uniq
awk
PS I did see after I posted this, its the same as jaypal posted in a comment.
Upvotes: 0
Reputation: 6098
Pure Python solution, using the Counter
tool from the collections
module.
I have no idea how this will do with 100k addresses, but you could give it a go.
from collections import Counter
with open('ip_file.txt', 'r') as f:
ip_list = map(lambda x: x.strip(), f.readlines())
ip_by_count = Counter(ip_list)
for ip in ip_by_count:
if ip_by_count[ip] > 1:
print ip
Or an alternative approach: maintain two sets, one of IPs seen exactly once, and one for IPs seen at least twice. Print an IP when we see it for a second time, and skip all subsequent appearances:
known_dupes = set()
single_ips = set()
with open('ip_file.txt', 'r') as f:
ip_list = map(lambda x: x.strip(), f.readlines())
for ip in ip_list:
if ip in known_dupes:
continue
elif ip in single_ips:
print ip
known_dupes.add(ip)
single_ips.remove(ip)
else:
single_ips.add(ip)
I suspect the first is probably faster, but I haven’t tried it on a large file to check.
Upvotes: 0
Reputation: 18917
Using sort
, uniq
and awk
:
pu@pumbair: ~ sort data.txt | uniq -c | awk '{if ($1 > 4) print $2,$3}'
67.0.105.76 0
67.0.123.150 0
67.0.163.127 0
67.0.232.158 0
Upvotes: 4