beefyhalo
beefyhalo

Reputation: 1821

HiveQL - Efficient geoip discovery for big data

I am using this hive query to discover all accounts' country codes. The problem is that the join is exploding the data set to unmanageable sizes, and even the job fails to complete sometimes.

What is a fast and efficient way to accomplish this geoip lookup query? Use of Hive is not mandatory

SELECT /*+ MAPJOIN(geoip) */
  data.account_id, geoip.country_code
FROM data JOIN geoip
WHERE data.ip_num BETWEEN geoip.begin_ip_num AND geoip.end_ip_num

Hive does not support BETWEEN inside ON clauses. Also, the WHERE filter is only applied after the join is complete. Any tips would be useful too. I am currently trying to solve this using a custom MapReduce job. Anyone have some ideas?

Info

Upvotes: 2

Views: 677

Answers (4)

beefyhalo
beefyhalo

Reputation: 1821

The chosen solution in the end was to bypass Hive. We serialize the geoip table as a regular java TreeMap, and upload it to custom mappers using Hadoop's DistributedCache.

Upvotes: 2

gencorruptor
gencorruptor

Reputation: 11

Try enabling predicate push down.

Open the file:

 $HIVE_CONF_DIR/hive-site.xml

Add:

<property>
  <name>hive.optimize.ppd to true</name>
  <value>false</value>
  <description>Whether to enable predicate pushdown</description>
</property>

Reference: PredicatePushDown.html

Upvotes: 1

www
www

Reputation: 4391

I guess that are no simple solution for now. No non-equi joins support s#c#s. Here is some solution https://reviews.facebook.net/D4137 . BTW good luck.

Upvotes: 0

TheCoder
TheCoder

Reputation: 1

SELECT d.account_id,d.a,d.b,d.c,i.country_code FROM data d OUTER APPLY (SELECT TOP 1 FROM geoip g WHERE d.ip_num BETWEEN g.begin_ip_num AND g.end_ip_num ) i

salute

Upvotes: -2

Related Questions