Sleeper Smith
Sleeper Smith

Reputation: 3242

hive ip geocoding (cross join semi-big tables)

My problem.

I have 500,000 distinct IP address I need to geocode. The Geocode look up table have an ip-from and ip-to range that I have to compare against, a table of 1.8 million rows.

So it's basically:

select *
/*+ MAPJOIN(a) */
from ip_address a
cross join  ip_lookup b
where a.AddressInt >= b.ip_from and a.AddressInt <= b.ip_to;

On aws EMR, I'm running a cluster of 10 m1.large and during the cross join phase it gets stuck at 0% for 20 min but here's the funny thing:

Stage-5: number of mappers: 1; number of reducers: 0

Questions: 1) any one have any better ideas than a cross join? I don't mind firing up a few (dozen) more instances but I doubt that will help and 2) am I REALLY doing a cross map join as in storing the ip_addresses in the memory?

Thanks in advance.

Upvotes: 2

Views: 1274

Answers (1)

Nigel Tufnel
Nigel Tufnel

Reputation: 11524

I had your (kind of) problem last year.

Since my geocode table fitted in RAM here's what I did:

  1. I've written Java class (let's call it GeoCoder) that read geocode info from the disc into RAM and did geocoding in-memory.
  2. I've added file geocode.info to the distributed cache (Hive add file command does this).
  3. I've written UDF that created (or used if it was already created) GeoCoder instance in the evaluate method. Hive UDF can get the local path of the file in the distributed cache via getClass().getClassLoader().getResource("geocode.info").getFile()
  4. Now I have the local path of geocode.info (now it's an ordinary file) and the rest is a history.

Probably this method is an overkill (150 lines of Java) but it worked for me.

Also I assume that you really need to use Hadoop (like I did) for your task. Geocoding of 500000 IPs could probably be done on laptop pretty fast.

Upvotes: 1

Related Questions