Reputation: 1821

HiveQL - Efficient geoip discovery for big data

I am using this hive query to discover all accounts' country codes. The problem is that the join is exploding the data set to unmanageable sizes, and even the job fails to complete sometimes.

What is a fast and efficient way to accomplish this geoip lookup query? Use of Hive is not mandatory

SELECT /*+ MAPJOIN(geoip) */
  data.account_id, geoip.country_code
FROM data JOIN geoip
WHERE data.ip_num BETWEEN geoip.begin_ip_num AND geoip.end_ip_num

Hive does not support BETWEEN inside ON clauses. Also, the WHERE filter is only applied after the join is complete. Any tips would be useful too. I am currently trying to solve this using a custom MapReduce job. Anyone have some ideas?

Info

geoip table is around 1MB in size
Hadoop cluster only has 12 map slots and 12 reduce slots
I have tried to bucket the geoip table based on the begin_ip_num first octet (ex: the 123 of 123.0.0.0), and then use the condition in the JOIN clause like FROM data JOIN geoip ON (data.first_octet_bucket=geoip.first_octet_bucket). This technique does not work since some rows will be left out.

Upvotes: 2

Answers (4)

beefyhalo

Reputation: 1821

The chosen solution in the end was to bypass Hive. We serialize the geoip table as a regular java TreeMap, and upload it to custom mappers using Hadoop's DistributedCache.

Upvotes: 2

gencorruptor

Reputation: 11

Try enabling predicate push down.

Open the file:

 $HIVE_CONF_DIR/hive-site.xml

Add:

<property>
  <name>hive.optimize.ppd to true</name>
  <value>false</value>
  <description>Whether to enable predicate pushdown</description>
</property>

Reference: PredicatePushDown.html

Upvotes: 1

www

Reputation: 4391

I guess that are no simple solution for now. No non-equi joins support s#c#s. Here is some solution https://reviews.facebook.net/D4137 . BTW good luck.

Upvotes: 0

TheCoder

Reputation: 1

SELECT d.account_id,d.a,d.b,d.c,i.country_code FROM data d OUTER APPLY (SELECT TOP 1 FROM geoip g WHERE d.ip_num BETWEEN g.begin_ip_num AND g.end_ip_num ) i

salute

Upvotes: -2

HiveQL - Efficient geoip discovery for big data

Info

Answers (4)

Related Questions