hive ip geocoding (cross join semi-big tables)

Question

My problem.

I have 500,000 distinct IP address I need to geocode. The Geocode look up table have an ip-from and ip-to range that I have to compare against, a table of 1.8 million rows.

So it's basically:

select *
/*+ MAPJOIN(a) */
from ip_address a
cross join  ip_lookup b
where a.AddressInt >= b.ip_from and a.AddressInt <= b.ip_to;

On aws EMR, I'm running a cluster of 10 m1.large and during the cross join phase it gets stuck at 0% for 20 min but here's the funny thing:

Stage-5: number of mappers: 1; number of reducers: 0

Questions: 1) any one have any better ideas than a cross join? I don't mind firing up a few (dozen) more instances but I doubt that will help and 2) am I REALLY doing a cross map join as in storing the ip_addresses in the memory?

Thanks in advance.

hive ip geocoding (cross join semi-big tables)

Answers (1)

Related Questions