Oleg Tarasenko
Oleg Tarasenko

Reputation: 9610

Find if IP address is in range of addresses

I am trying to build a service responsible for geo location of IP addresses. The open database of IP addresses is a CSV file of the following format: starting_ip, ending_ip, region

So I was thinking about converting IPs to integers, and trying to see if a given integer is within ranges of starting and ending... However at this point I don't quite see how this comparison can be performed in an efficient way, taking into account the size of 500K entries.

At first I was trying to load everything into memory using the following dict:

{(ip_start, ip_end): 'region', ....}

But at this point I don't see how to find a key in this dict by IP address.

Upvotes: 1

Views: 1272

Answers (3)

j_s
j_s

Reputation: 103

The netaddr package has an IPSet class that should help with this in a reasonably optimal way.

from netaddr import IPRange, IPSet
# 1. Your CSV data
rows = [
    ('14.2.0.0', '14.2.255.255', 'bar'),
    ('24.2.1.0', '24.2.1.120', 'baz'),
    ('30.2.0.0', '30.2.3.3', 'bar'),
    ('30.3.0.0', '30.3.127.255', 'bar'),
    # ...
]

# 2. Create region -> IPSet mapping. This may be relatively slow.
regions = {}
for start_ip, stop_ip, region in rows:
    regions.setdefault(region, IPSet())
    regions[region].add(IPRange(start_ip, stop_ip))

# 3. Find a matching region for a given address. This will be fast.
def find_region(address):
    for region, ip_set in regions.items():
        if address in ip_set:
            return region
    return None

# 4. Demonstration
print(find_region('14.2.50.50'))  # Prints bar
print(find_region('8.8.8.8'))  # Prints None

Upvotes: 0

Padraic Cunningham
Padraic Cunningham

Reputation: 180502

I would suggest you persist the data once sorted in whatever usable format you like but using a sortedtcontainers SortedDict will allow you to do the comparison in log n time once you have a sorted collection, sorted by the start ip:

import csv
from sortedcontainers import sorteddict

with open("ips.csv") as f:
    ips = ["192.168.43.102", "10.10.145.100", "192.168.1.1", "192.168.43.99","127.0.0.1"]
    reader = csv.reader(f)
     # Use start ip as the key, creating tuple or using netaddr to turn into an int
    sorted_dict = sorteddict.SortedDict((tuple(map(int, sip.split("."))),(eip, rnge)) for sip, eip, rnge in reader)

    for ip in ips:
        # do the same for the ip you want to search for
        ip = tuple(map(int, ip.split(".")))
         # bisect to see where the ip would land
        ind = sorted_dict.bisect(ip) - 1
        start_ip = sorted_dict.iloc[ind]
        end_ip = tuple(map(int, sorted_dict[sorted_dict.iloc[ind]][0].split(".")))
        print(start_ip, ip, end_ip)
        print(start_ip <= ip <= end_ip)

If we run the code on a test file:

In [5]: !cat ips.csv
192.168.43.100,192.168.43.130,foo
192.168.27.1,192.168.27.12,foobar
192.168.1.1,192.168.1.98,bar
192.168.43.131,192.168.43.140,bar
10.10.131.10,10.10.131.15,barf
10.10.145.10,10.10.145.100,foob

In [6]: import csv

In [7]: from sortedcontainers import sorteddict

In [8]: with open("ips.csv") as f:
   ...:         ips = ["192.168.43.102", "10.10.145.100", "192.168.1.1", "192.168.43.99","127.0.0.1"]
   ...:         reader = csv.reader(f)
   ...:         sorted_dict = sorteddict.SortedDict((tuple(map(int, sip.split("."))),(eip, rnge)) for sip, eip, rnge in reader)
   ...:         for ip in ips:
   ...:                 ip = tuple(map(int, ip.split(".")))
   ...:                 ind = sorted_dict.bisect(ip) - 1
   ...:                 start_ip = sorted_dict.iloc[ind]
   ...:                 end_ip = tuple(map(int, sorted_dict[sorted_dict.iloc[ind]][0].split(".")))
   ...:                 print(start_ip,ip, end_ip)
   ...:                 print(start_ip <= ip <= end_ip)
   ...:         
(192, 168, 43, 100) (192, 168, 43, 102) (192, 168, 43, 130)
True
(10, 10, 145, 10) (10, 10, 145, 100) (10, 10, 145, 100)
True
(192, 168, 1, 1) (192, 168, 1, 1) (192, 168, 1, 98)
True
(192, 168, 27, 1) (192, 168, 43, 99) (192, 168, 27, 12)
False
(10, 10, 145, 10) (127, 0, 0, 1) (10, 10, 145, 100)
False

You could also modify bisect_right to only consider the first elements and use a regular python list:

def bisect_right(a, x, lo=0, hi=None):
    if lo < 0:
        raise ValueError('lo must be non-negative')
    if hi is None:
        hi = len(a)
    while lo < hi:
        mid = (lo+hi) // 2
        if x < a[mid][0]:
            hi = mid
        else:
            lo = mid + 1
    return lo

with open("ips.csv") as f:
    ips = ["192.168.43.102", "10.10.145.100", "192.168.1.1", "192.168.43.99", "127.0.0.1"]
    reader = csv.reader(f)
    sorted_data = sorted(((tuple(map(int, sip.split("."))), eip, rnge) for sip, eip, rnge in reader))
    for ip in ips:
        ip = tuple(map(int, ip.split(".")))
        ind = bisect_right(sorted_data, ip) - 1

        ip_sub = sorted_data[ind]
        start_ip, end_ip, _ = sorted_data[ind]
        end_ip = tuple(map(int, end_ip.split(".")))
        print(start_ip, ip, end_ip)
        print(start_ip <= ip <= end_ip)

The result will be the same, I would imagine using a SortedDict is almost certainly faster as the bisect is done at the c level.

Upvotes: 0

NPE
NPE

Reputation: 500873

Assuming the ranges are non-overlapping, you could sort them once by ip_start and then use binary search to find a candidate range. Once you've found a candidate range, all you have to do is check whether the IP address falls between ip_start and ip_end.

You could use the built-in bisect module to perform the binary search.

This gives O(logn) lookup cost.

Upvotes: 2

Related Questions