Joe legrae
Joe legrae

Reputation: 147

Given a flat file of IP Ranges and mappings, find a city given an IP

This is the question:

Given a flat text file that contains a range of IP addresses that map to a location (e.g. 192.168.0.0-192.168.0.255 = Boston, MA), come up with an algorithm that will find a city for a specific ip address if a mapping exists.

My only idea is parse the file, and turn the IP ranges into just ints (multiplying by 10/100 if it's missing digits) and place them in a list, while also putting the lower of the ranges into a hash as the key with the location as a value. Sort the list and perform a slightly modified binary search. If the index is odd, -1 and look in the hash. If it's even, just look in the hash.

Any faults in my plans, or better solutions?

Upvotes: 12

Views: 2896

Answers (4)

Tony Delroy
Tony Delroy

Reputation: 106126

My only idea is parse the file, and turn the IP ranges into just ints (multiplying by 10/100 if it's missing digits)...

If following this approach, you would probably want to multiply by 256^3, 256^2, 256 and 1 respectively for A, B, C and D in an address A.B.C.D. That effectively recreates the IP address as a 32-bit unsigned number.

... and place them in a list, while also putting the lower of the ranges into a hash as the key with the location as a value. Sort the list and perform a slightly modified binary search. If the index is odd, -1 and look in the hash. If it's even, just look in the hash.

I would suggest creating a contiguous array (a std::vector) containing structs with the lower and upper ranges (and location name - discussed below). Then as you say you can binary search for a range including a specific value, without any odd/even hassles.

Using the lower end of the range as a key in a hash is one way to avoid having space for the location names in the array, but given the average number of characters in a city name, the likely size of pointers, a choice between a sparsely populated hash table and lengthly displacement lists to search in successive alternative buckets or further indirection to arbitrary length containers - you'd need to be pretty desperate to bother trying. In the first instance, storing the location in struct alongside the IP value range seems good.

Alternatively, you could create a tree based on e.g. the individual 0-255 IP values: each level in the tree could be either an array of 256 values for direct indexing, or a sorted array of populated values. That can reduce the number of IP value comparisons you're likely to need to make (O(log2N) to O(1)).

Upvotes: 1

reddragon
reddragon

Reputation: 971

The problem smells of ranges, and one of the good data-structures for this problem would be a Segment Tree. Some resources to help you get started.

The root of the segment tree can represent the addresses (0.0.0.0 - 255.255.255.255). The left sub-tree would represent the addresses (0.0.0.0 - 127.255.255.255) and the right sub-tree would represent the range (128.0.0.0 - 255.255.255.255), and so on. This will go on till we reach ranges which cannot be further sub-divided. Say, if we have the range 32.0.0.0 - 63.255.255.255, mapped to some arbitrary city, it will be a leaf node, we will not further subdivide that range when we arrive there, and tag it to the specific city.

To search for a specific mapping, we follow the tree, just as we do in a Binary Search Tree. If your IP lies in the range of the left sub-tree, move to the left sub-tree, else move to the right sub-tree.

The good parts:

  1. You need not have all sub-trees, only add the sub-trees which are required. For example, if in your data, there is no city mapped for the range (0.0.0.0 - 127.255.255.255), we will not construct that sub-tree.
  2. We are space efficient. If the entire range is mapped to one city, we will create only the root node!
  3. This is a dynamic data-structure. You can add more cities, split-up ranges later on, etc.
  4. You will be making constant number of operations, since the maximum depth of the tree would be 4 x log2(256) = 32. For this particular problem it turns out that Segment Trees would be as fast as van-Emde Boas trees, and require lesser space (O(N)).
  5. This is a simple, but non-trivial data-structure, which is better than sorting, because it is dynamic, and easier to explain to your interviewer than van-Emde Boas trees.
  6. This is one of the easiest non-trivial data-structures to code :)

Please note that in some Segment Tree tutorials, they use arrays to represent the tree. This is probably not what you want, since we would not be populating the entire tree, so dynamically allocating nodes, just like we do in a standard Binary Tree is the best.

Upvotes: 6

templatetypedef
templatetypedef

Reputation: 372814

Your approach seems perfectly reasonable.

If you are interested in doing a bit of research / extra coding, there are algorithms that will asymptotically outperform the standard binary search technique that rely on the fact that your IP addresses can be interpreted as integers in the range from 0 to 231 - 1. For example, the van Emde Boas tree and y-Fast Trie data structures can implement the predecessor search operation that you're looking at in time O(log log U), where U is the maximum possible IP address, as opposed to the O(log N) approach that binary search uses. The constant factors are higher, though, which means that there is no guarantee that this approach will be any faster. However, it might be worth exploring as another approach that could potentially be even faster.

Hope this helps!

Upvotes: 5

Chetter Hummin
Chetter Hummin

Reputation: 6817

In your example, 192.168.0.0-192.168.0.255 = Boston, MA.

Will the first three octets (192.168.0) be the same for both IP addresses in the entry? Also, will the first three octets be unique for a city?

If so, then this problem can solved more easily

Upvotes: 0

Related Questions