Reputation: 1821
I am using this hive query to discover all accounts' country codes. The problem is that the join is exploding the data set to unmanageable sizes, and even the job fails to complete sometimes.
What is a fast and efficient way to accomplish this geoip lookup query? Use of Hive is not mandatory
SELECT /*+ MAPJOIN(geoip) */
data.account_id, geoip.country_code
FROM data JOIN geoip
WHERE data.ip_num BETWEEN geoip.begin_ip_num AND geoip.end_ip_num
Hive does not support BETWEEN
inside ON
clauses. Also, the WHERE
filter is only applied after the join is complete. Any tips would be useful too. I am currently trying to solve this using a custom MapReduce job. Anyone have some ideas?
geoip
table is around 1MB in sizebegin_ip_num
first octet (ex: the 123
of 123.0.0.0
), and then use the condition in the JOIN
clause like FROM data JOIN geoip ON (data.first_octet_bucket=geoip.first_octet_bucket)
. This technique does not work since some rows will be left out.Upvotes: 2
Views: 677
Reputation: 1821
The chosen solution in the end was to bypass Hive. We serialize the geoip table as a regular java TreeMap, and upload it to custom mappers using Hadoop's DistributedCache.
Upvotes: 2
Reputation: 11
Try enabling predicate push down.
Open the file:
$HIVE_CONF_DIR/hive-site.xml
Add:
<property>
<name>hive.optimize.ppd to true</name>
<value>false</value>
<description>Whether to enable predicate pushdown</description>
</property>
Reference: PredicatePushDown.html
Upvotes: 1
Reputation: 4391
I guess that are no simple solution for now. No non-equi joins support s#c#s. Here is some solution https://reviews.facebook.net/D4137 . BTW good luck.
Upvotes: 0
Reputation: 1
SELECT d.account_id,d.a,d.b,d.c,i.country_code FROM data d OUTER APPLY (SELECT TOP 1 FROM geoip g WHERE d.ip_num BETWEEN g.begin_ip_num AND g.end_ip_num ) i
salute
Upvotes: -2