In Hive SQL - joining with intervals without UDF

Question

I've come across exercise which asks to match event-related IPs from one table with countries IP ranges from the other table. I.e. it may look like this (simplified):

table: events

event_id  |  source_ip
----------------------
12345678  |  3.15.49.5
31234314  |  7.1.8.190

table: geoips

country  |  start_ip  |  end_ip
-----------------------------------
us       |  1.0.0.0   |  1.127.255.255
us       |  1.128.0.0 |  1.255.255.255
us       |  3.0.0.0   |  3.255.255.255

and we want to get:

event_id  |  source_ip  |  country
----------------------------------
12345678  |  3.15.49.5  |  us
31234314  |  7.1.8.190  |  uk

Suppose, we can convert IPs to integers to simplify comparison (or convert to zero-padded strings so they could be compared alphabetically).

So is like a join on event_ip >= start_ip and event_ip <= end_ip. However as I understand it is not going to work that straightforward in Hive as "only equality joins are supported".

Most often suggestion (and also in this exercise) is to use UDF - as I understand it is only possible if the range-containing table fits in memory.

Though I do know how to write UDF, I'm not satisfied with this approach. Especially as it don't say what to do if ranges table is very large (not this case, of course) and don't fit in memory easily.

Intuitively it seems, that, aside from Hive, if we have both tables sorted by IP, we can solve the problem in one pass, maintaining the "current range" and matching all upcoming IPs to it, then updating to next range. This even should be easy enough to parallelize...

So I wonder, if (perhaps, in later versions of Hive) there is a solution relying on HQL itself.

In Hive SQL - joining with intervals without UDF

Answers (1)

Related Questions