Rodion Gorkovenko
Rodion Gorkovenko

Reputation: 2852

In Hive SQL - joining with intervals without UDF

I've come across exercise which asks to match event-related IPs from one table with countries IP ranges from the other table. I.e. it may look like this (simplified):

table: events

event_id  |  source_ip
----------------------
12345678  |  3.15.49.5
31234314  |  7.1.8.190

table: geoips

country  |  start_ip  |  end_ip
-----------------------------------
us       |  1.0.0.0   |  1.127.255.255
us       |  1.128.0.0 |  1.255.255.255
us       |  3.0.0.0   |  3.255.255.255

and we want to get:

event_id  |  source_ip  |  country
----------------------------------
12345678  |  3.15.49.5  |  us
31234314  |  7.1.8.190  |  uk

Suppose, we can convert IPs to integers to simplify comparison (or convert to zero-padded strings so they could be compared alphabetically).

So is like a join on event_ip >= start_ip and event_ip <= end_ip. However as I understand it is not going to work that straightforward in Hive as "only equality joins are supported".

Most often suggestion (and also in this exercise) is to use UDF - as I understand it is only possible if the range-containing table fits in memory.

Though I do know how to write UDF, I'm not satisfied with this approach. Especially as it don't say what to do if ranges table is very large (not this case, of course) and don't fit in memory easily.

Intuitively it seems, that, aside from Hive, if we have both tables sorted by IP, we can solve the problem in one pass, maintaining the "current range" and matching all upcoming IPs to it, then updating to next range. This even should be easy enough to parallelize...

So I wonder, if (perhaps, in later versions of Hive) there is a solution relying on HQL itself.

Upvotes: 1

Views: 195

Answers (1)

leftjoin
leftjoin

Reputation: 38325

Not-equality joins are not allowed in Hive. And join without On condition is a cross join, it will multiply rows and Where clause applied to the resulting dataset will filter big number of rows, say x195 (the number of countries) and this will cause performance problems.

However in this case you can try to convert CROSS join to map-join, because country table fits in memory. The solution is a MAP-JOIN. Small table will be broadcasted to each mapper, loaded in memory and used for join. EXPLAIN should reflect MAP-JOIN in the plan, check it. Mapjoin will work fast.

Use Tez, vectorized execution, mapjoin and cbo:

set hive.cbo.enable=true;
set hive.execution.engine=tez;
set hive.vectorized.execution.enabled = true;
set hive.vectorized.execution.reduce.enabled = true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.optimize.ppd=true;
--enable mapjoin
set hive.auto.convert.join=true;

select e.event_id, e.source_ip, i.country
    from events e 
         cross join geoips i 
   where e.event_ip >= i.start_ip and e.event_ip <= i.end_ip;

Upvotes: 1

Related Questions