Reputation: 9610
I am trying to build a service responsible for geo location of IP addresses. The open database of IP addresses is a CSV file of the following format: starting_ip, ending_ip, region
So I was thinking about converting IPs to integers, and trying to see if a given integer is within ranges of starting and ending... However at this point I don't quite see how this comparison can be performed in an efficient way, taking into account the size of 500K entries.
At first I was trying to load everything into memory using the following dict:
{(ip_start, ip_end): 'region', ....}
But at this point I don't see how to find a key in this dict by IP address.
Upvotes: 1
Views: 1272
Reputation: 103
The netaddr
package has an IPSet
class that should help with this in a reasonably optimal way.
from netaddr import IPRange, IPSet
# 1. Your CSV data
rows = [
('14.2.0.0', '14.2.255.255', 'bar'),
('24.2.1.0', '24.2.1.120', 'baz'),
('30.2.0.0', '30.2.3.3', 'bar'),
('30.3.0.0', '30.3.127.255', 'bar'),
# ...
]
# 2. Create region -> IPSet mapping. This may be relatively slow.
regions = {}
for start_ip, stop_ip, region in rows:
regions.setdefault(region, IPSet())
regions[region].add(IPRange(start_ip, stop_ip))
# 3. Find a matching region for a given address. This will be fast.
def find_region(address):
for region, ip_set in regions.items():
if address in ip_set:
return region
return None
# 4. Demonstration
print(find_region('14.2.50.50')) # Prints bar
print(find_region('8.8.8.8')) # Prints None
Upvotes: 0
Reputation: 180502
I would suggest you persist the data once sorted in whatever usable format you like but using a sortedtcontainers SortedDict will allow you to do the comparison in log n time once you have a sorted collection, sorted by the start ip:
import csv
from sortedcontainers import sorteddict
with open("ips.csv") as f:
ips = ["192.168.43.102", "10.10.145.100", "192.168.1.1", "192.168.43.99","127.0.0.1"]
reader = csv.reader(f)
# Use start ip as the key, creating tuple or using netaddr to turn into an int
sorted_dict = sorteddict.SortedDict((tuple(map(int, sip.split("."))),(eip, rnge)) for sip, eip, rnge in reader)
for ip in ips:
# do the same for the ip you want to search for
ip = tuple(map(int, ip.split(".")))
# bisect to see where the ip would land
ind = sorted_dict.bisect(ip) - 1
start_ip = sorted_dict.iloc[ind]
end_ip = tuple(map(int, sorted_dict[sorted_dict.iloc[ind]][0].split(".")))
print(start_ip, ip, end_ip)
print(start_ip <= ip <= end_ip)
If we run the code on a test file:
In [5]: !cat ips.csv
192.168.43.100,192.168.43.130,foo
192.168.27.1,192.168.27.12,foobar
192.168.1.1,192.168.1.98,bar
192.168.43.131,192.168.43.140,bar
10.10.131.10,10.10.131.15,barf
10.10.145.10,10.10.145.100,foob
In [6]: import csv
In [7]: from sortedcontainers import sorteddict
In [8]: with open("ips.csv") as f:
...: ips = ["192.168.43.102", "10.10.145.100", "192.168.1.1", "192.168.43.99","127.0.0.1"]
...: reader = csv.reader(f)
...: sorted_dict = sorteddict.SortedDict((tuple(map(int, sip.split("."))),(eip, rnge)) for sip, eip, rnge in reader)
...: for ip in ips:
...: ip = tuple(map(int, ip.split(".")))
...: ind = sorted_dict.bisect(ip) - 1
...: start_ip = sorted_dict.iloc[ind]
...: end_ip = tuple(map(int, sorted_dict[sorted_dict.iloc[ind]][0].split(".")))
...: print(start_ip,ip, end_ip)
...: print(start_ip <= ip <= end_ip)
...:
(192, 168, 43, 100) (192, 168, 43, 102) (192, 168, 43, 130)
True
(10, 10, 145, 10) (10, 10, 145, 100) (10, 10, 145, 100)
True
(192, 168, 1, 1) (192, 168, 1, 1) (192, 168, 1, 98)
True
(192, 168, 27, 1) (192, 168, 43, 99) (192, 168, 27, 12)
False
(10, 10, 145, 10) (127, 0, 0, 1) (10, 10, 145, 100)
False
You could also modify bisect_right to only consider the first elements and use a regular python list:
def bisect_right(a, x, lo=0, hi=None):
if lo < 0:
raise ValueError('lo must be non-negative')
if hi is None:
hi = len(a)
while lo < hi:
mid = (lo+hi) // 2
if x < a[mid][0]:
hi = mid
else:
lo = mid + 1
return lo
with open("ips.csv") as f:
ips = ["192.168.43.102", "10.10.145.100", "192.168.1.1", "192.168.43.99", "127.0.0.1"]
reader = csv.reader(f)
sorted_data = sorted(((tuple(map(int, sip.split("."))), eip, rnge) for sip, eip, rnge in reader))
for ip in ips:
ip = tuple(map(int, ip.split(".")))
ind = bisect_right(sorted_data, ip) - 1
ip_sub = sorted_data[ind]
start_ip, end_ip, _ = sorted_data[ind]
end_ip = tuple(map(int, end_ip.split(".")))
print(start_ip, ip, end_ip)
print(start_ip <= ip <= end_ip)
The result will be the same, I would imagine using a SortedDict is almost certainly faster as the bisect is done at the c level.
Upvotes: 0
Reputation: 500873
Assuming the ranges are non-overlapping, you could sort them once by ip_start
and then use binary search to find a candidate range. Once you've found a candidate range, all you have to do is check whether the IP address falls between ip_start
and ip_end
.
You could use the built-in bisect
module to perform the binary search.
This gives O(logn)
lookup cost.
Upvotes: 2