Reputation: 3242
My problem.
I have 500,000 distinct IP address I need to geocode. The Geocode look up table have an ip-from and ip-to range that I have to compare against, a table of 1.8 million rows.
So it's basically:
select *
/*+ MAPJOIN(a) */
from ip_address a
cross join ip_lookup b
where a.AddressInt >= b.ip_from and a.AddressInt <= b.ip_to;
On aws EMR, I'm running a cluster of 10 m1.large and during the cross join phase it gets stuck at 0% for 20 min but here's the funny thing:
Stage-5: number of mappers: 1; number of reducers: 0
Questions: 1) any one have any better ideas than a cross join? I don't mind firing up a few (dozen) more instances but I doubt that will help and 2) am I REALLY doing a cross map join as in storing the ip_addresses in the memory?
Thanks in advance.
Upvotes: 2
Views: 1274
Reputation: 11524
I had your (kind of) problem last year.
Since my geocode table fitted in RAM here's what I did:
GeoCoder
) that read geocode info from the disc into RAM and
did geocoding in-memory.geocode.info
to the distributed cache (Hive add file
command does this).UDF
that created (or used if it was already created) GeoCoder
instance in the evaluate
method. Hive UDF
can get the local path of the file in the distributed cache via getClass().getClassLoader().getResource("geocode.info").getFile()
geocode.info
(now it's an ordinary file) and the rest is a history.Probably this method is an overkill (150 lines of Java) but it worked for me.
Also I assume that you really need to use Hadoop (like I did) for your task. Geocoding of 500000 IPs could probably be done on laptop pretty fast.
Upvotes: 1