Reputation: 2977
I got 2 million IP addresses and 25 million IP ranges with Start IP, End IP and Geo-Locations stored in PostgreSQL. Is there an efficient way to look up the geo-locations of those 2 million IPs from the 25 million database? What I did was to compare whether an IP address falls in between the Start IP and End IP and look up the corresponding location. However this seems taking forever. Presumably this is more like looking up a bunch of integers from a set of ranges such like searching {7, 13, 31, 42} from:
Start End Loc
1 10 US
11 20 US
21 26 CN
29 32 SE
33 45 CA
and returning:
7 US
13 US
31 SE
42 CA
Please note that the ranges may not be necessarily connected and the size may not be the same. Thank you!
EDIT
As a concrete example, here's the data I'm dealing with:
start_ip | end_ip | country | region | city |
------------------+------------------+---------+-----------+-----------+-
1.33.254.73/32 | 1.33.254.73/32 | jpn | 33 | kurashiki |
1.39.1.0/32 | 1.39.4.255/32 | ind | mh | mumbai |
1.40.144.0/32 | 1.40.145.255/32 | aus | ns | fairfield |
1.40.235.0/32 | 1.40.242.255/32 | aus | ns | sydney |
1.44.28.0/32 | 1.44.29.255/32 | aus | vi | melbourne |
1.44.82.0/32 | 1.44.83.255/32 | aus | vi | melbourne |
1.44.92.0/32 | 1.44.93.255/32 | aus | vi | melbourne |
1.44.128.0/32 | 1.44.129.255/32 | aus | vi | melbourne |
1.44.220.0/32 | 1.44.221.255/32 | aus | vi | melbourne |
......
......
And the queries are something like:
75.149.219.61/32
68.239.61.29/32
96.41.50.165/32
183.62.126.7/32
......
Upvotes: 2
Views: 1176
Reputation: 3629
You'll have to provide your query and query plan for meaningful input on this. E.g.:
explain select hits.ip, locations.loc
from hits left outer join locations
on (hits.ip >= locations.start and hits.ip <= locations.stop);
QUERY PLAN
-------------------------------------------------------------------------------
Nested Loop Left Join (cost=0.00..245.06 rows=2400 width=36)
Join Filter: ((hits.ip >= locations.start) AND (hits.ip <= locations.stop))
-> Seq Scan on hits (cost=0.00..34.00 rows=2400 width=4)
-> Materialize (cost=0.00..1.07 rows=5 width=40)
-> Seq Scan on locations (cost=0.00..1.05 rows=5 width=40)
(5 rows)
I'm not sure you want to add location data to your index like one of the other answers suggest. That's just dead data adding bloat, it's no good for looking up rows.
Even if you use a pg version that supports index only scans (9.2, which is still in beta), a smaller leaner index will probably still give faster results with one additional tuple lookup per row.
Upvotes: 1
Reputation: 57408
The best and more elegant solution would be, I think, to have IPs and ranges stored as inet format. IP ranges are usually published in the network/mask format anyway, not as Start/End. This allows to write a JOIN based
ON (ip.addr << geoloc.range)
Of course the ip tables should be indexed by addr and geoloc by (range, location), and if you do not have the CIDR format and need to build it from Start/End, that could be expensive (yet, the table would be easier to use afterwards).
See
http://www.postgresql.org/docs/9.0/static/functions-net.html
EDIT: unfortunately, those start/end values look like "optimized" CIDR ranges. In other words, e.g.
1.40.235.0 1.40.242.255
is actually the merging of four separate contiguous ranges:
11101011 235.0-235.255
11101100 236.0-239.255
11101111
11110000 240.0-241.255
11110001
11110010 242.0-242.255
so it's not practical to explode the row into the four rows necessary for CIDR operation.
The Start/End look in cidr datatype, so converting them to inet (they're all /32 anyway...) and keeping the queried value in inet datatype as well, indexing on Start, End, ought to give reasonable results:
SELECT query.ip, geoloc.country, geoloc.region, geoloc.city
FROM query JOIN geoloc
ON (query.ip >= geoloc.start_ip AND query.ip <= geoloc.end_ip);
Another alternative, NOT very elegant (actually a hack), would be to "explode" both ip and geoloc tables, based on e.g. the first byte of addr and range, into separate sub-tables (I don't expect you to have an IP range with different first byte).
SELECT * FROM geoloc
WHERE start_ip >= inet '5.0.0.0' and end_ip <= inet '5.255.255.255'
INTO TABLE geoloc_5;
SELECT * FROM query
WHERE start_ip >= inet '5.0.0.0' and end_ip <= inet '5.255.255.255'
INTO TABLE query_5;
Remember to CREATE INDEX on geoloc_5 start_ip, end_ip
This approach did work several years ago, for a big PostgreSQL batch, but I expect that, since then, a more clever index manager - together with dedicated datatype - will have evolved to more than match this DIY partitioning. So, the naive Jordan partitioning should only be used as a very last solution if the << CIDR operator can't be used.
That said, suppose that both tables have a flat distribution (just to get a ballpark figure).
Then, instead of one SELECT, on 2M x 25M records, you run 256 SELECTs of 2M/256 by 25M/256. So instead of 1 x 2M x 25M = 50 T, you have 256 x 2M/256 x 25M/256 = 192G comparisons, which ought to be around 200x faster in respect to a straight JOIN.
But I repeat, I expect that PostgreSQL, seeing a properly indexed CIDR field, will no longer really perform a "straight" JOIN, but employ this trick (and then some).
Upvotes: 2
Reputation: 3219
If you're querying the Loc
column, you should add an index to it. Also, since this is a 3 column table, it might be wise to combine StartIP
and EndIP
, use that as the key, and use the Geolocation
as the value, and then read this all from a key-value store such as Redis or Memcached. NoSQL/tableless data stores are designed for this sort of thing, where you are reading against millions of data points.
EDIT: after reading some comments, it occurred to me that another solution would be to parallelize your search through something like MapReduce. Assign threads to query a range of IPs (e.g. Thread1: 1-10, Thread2: 11-20, etc...) in the Map step, and then assign threads to reduce the fragmented queries into one result in the Reduce step. You'd obviously need a separate programming language to script this, but the concurrency would help reduce your overall load time, though the downside would be multiple queries into the database.
Upvotes: 1