Rock
Rock

Reputation: 2977

Batch geo locate millions of IPs

I got 2 million IP addresses and 25 million IP ranges with Start IP, End IP and Geo-Locations stored in PostgreSQL. Is there an efficient way to look up the geo-locations of those 2 million IPs from the 25 million database? What I did was to compare whether an IP address falls in between the Start IP and End IP and look up the corresponding location. However this seems taking forever. Presumably this is more like looking up a bunch of integers from a set of ranges such like searching {7, 13, 31, 42} from:

Start End Loc
1     10  US
11    20  US
21    26  CN
29    32  SE
33    45  CA

and returning:

7  US
13 US
31 SE
42 CA

Please note that the ranges may not be necessarily connected and the size may not be the same. Thank you!

EDIT

As a concrete example, here's the data I'm dealing with:

     start_ip     |      end_ip      | country |  region   |   city    | 
------------------+------------------+---------+-----------+-----------+-
 1.33.254.73/32   | 1.33.254.73/32   | jpn     | 33        | kurashiki | 
 1.39.1.0/32      | 1.39.4.255/32    | ind     | mh        | mumbai    | 
 1.40.144.0/32    | 1.40.145.255/32  | aus     | ns        | fairfield | 
 1.40.235.0/32    | 1.40.242.255/32  | aus     | ns        | sydney    | 
 1.44.28.0/32     | 1.44.29.255/32   | aus     | vi        | melbourne | 
 1.44.82.0/32     | 1.44.83.255/32   | aus     | vi        | melbourne | 
 1.44.92.0/32     | 1.44.93.255/32   | aus     | vi        | melbourne | 
 1.44.128.0/32    | 1.44.129.255/32  | aus     | vi        | melbourne | 
 1.44.220.0/32    | 1.44.221.255/32  | aus     | vi        | melbourne | 
 ......
 ......

And the queries are something like:

 75.149.219.61/32
 68.239.61.29/32
 96.41.50.165/32
 183.62.126.7/32
 ......

Upvotes: 2

Views: 1176

Answers (3)

eevar
eevar

Reputation: 3629

You'll have to provide your query and query plan for meaningful input on this. E.g.:

explain select hits.ip, locations.loc
 from hits left outer join locations
   on (hits.ip >= locations.start and hits.ip <= locations.stop);
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.00..245.06 rows=2400 width=36)
   Join Filter: ((hits.ip >= locations.start) AND (hits.ip <= locations.stop))
   ->  Seq Scan on hits  (cost=0.00..34.00 rows=2400 width=4)
   ->  Materialize  (cost=0.00..1.07 rows=5 width=40)
         ->  Seq Scan on locations  (cost=0.00..1.05 rows=5 width=40)
(5 rows)

I'm not sure you want to add location data to your index like one of the other answers suggest. That's just dead data adding bloat, it's no good for looking up rows.

Even if you use a pg version that supports index only scans (9.2, which is still in beta), a smaller leaner index will probably still give faster results with one additional tuple lookup per row.

Upvotes: 1

LSerni
LSerni

Reputation: 57408

The best and more elegant solution would be, I think, to have IPs and ranges stored as inet format. IP ranges are usually published in the network/mask format anyway, not as Start/End. This allows to write a JOIN based

ON (ip.addr << geoloc.range)

Of course the ip tables should be indexed by addr and geoloc by (range, location), and if you do not have the CIDR format and need to build it from Start/End, that could be expensive (yet, the table would be easier to use afterwards).

See

http://www.postgresql.org/docs/9.0/static/functions-net.html

EDIT: unfortunately, those start/end values look like "optimized" CIDR ranges. In other words, e.g.

1.40.235.0     1.40.242.255

is actually the merging of four separate contiguous ranges:

11101011   235.0-235.255
    11101100   236.0-239.255
    11101111   
    11110000   240.0-241.255   
    11110001
11110010   242.0-242.255

so it's not practical to explode the row into the four rows necessary for CIDR operation.

The Start/End look in cidr datatype, so converting them to inet (they're all /32 anyway...) and keeping the queried value in inet datatype as well, indexing on Start, End, ought to give reasonable results:

 SELECT query.ip, geoloc.country, geoloc.region, geoloc.city
     FROM query JOIN geoloc
     ON (query.ip >= geoloc.start_ip AND query.ip <= geoloc.end_ip);

Another alternative, NOT very elegant (actually a hack), would be to "explode" both ip and geoloc tables, based on e.g. the first byte of addr and range, into separate sub-tables (I don't expect you to have an IP range with different first byte).

 SELECT * FROM geoloc
     WHERE start_ip >= inet '5.0.0.0' and end_ip <= inet '5.255.255.255'
     INTO TABLE geoloc_5;

 SELECT * FROM query
     WHERE start_ip >= inet '5.0.0.0' and end_ip <= inet '5.255.255.255'
     INTO TABLE query_5;

 Remember to CREATE INDEX on geoloc_5 start_ip, end_ip

This approach did work several years ago, for a big PostgreSQL batch, but I expect that, since then, a more clever index manager - together with dedicated datatype - will have evolved to more than match this DIY partitioning. So, the naive Jordan partitioning should only be used as a very last solution if the << CIDR operator can't be used.

That said, suppose that both tables have a flat distribution (just to get a ballpark figure).

Then, instead of one SELECT, on 2M x 25M records, you run 256 SELECTs of 2M/256 by 25M/256. So instead of 1 x 2M x 25M = 50 T, you have 256 x 2M/256 x 25M/256 = 192G comparisons, which ought to be around 200x faster in respect to a straight JOIN.

But I repeat, I expect that PostgreSQL, seeing a properly indexed CIDR field, will no longer really perform a "straight" JOIN, but employ this trick (and then some).

Upvotes: 2

acconrad
acconrad

Reputation: 3219

If you're querying the Loc column, you should add an index to it. Also, since this is a 3 column table, it might be wise to combine StartIP and EndIP, use that as the key, and use the Geolocation as the value, and then read this all from a key-value store such as Redis or Memcached. NoSQL/tableless data stores are designed for this sort of thing, where you are reading against millions of data points.

EDIT: after reading some comments, it occurred to me that another solution would be to parallelize your search through something like MapReduce. Assign threads to query a range of IPs (e.g. Thread1: 1-10, Thread2: 11-20, etc...) in the Map step, and then assign threads to reduce the fragmented queries into one result in the Reduce step. You'd obviously need a separate programming language to script this, but the concurrency would help reduce your overall load time, though the downside would be multiple queries into the database.

Upvotes: 1

Related Questions