doublebyte
doublebyte

Reputation: 1275

Non Equality Join on Hive

I am trying to emulate a SQL query on hive, where I grab data from two tables that do not have a common field.

The two tables are geometric (GeoHive) and one contains a bunch of points (one for each record) and the other one a grid (one cell for each record). The objective is to count how many points fit inside each cell.

enter image description here

The join condition between the two tables is the geometric function itself. This is how I do it in PostGIS:

SELECT g.geom, count(t.geom) AS cnt
   FROM grid g, points t
WHERE st_contains(g.geom,t.geom) GROUP BY g.geom

Hive does not accept the multiple select, so I have to perform a join. I thought something like this could work:

SELECT count(1) FROM grid JOIN points WHERE   
 ST_Contains(grid.geom,points.geom) GROUP BY grid.geom;

, but hive silently ignores my condition:

"Hive does not support join conditions that are not equality conditions as it is very difficult to express such conditions as a map/reduce job"

Anyone has any ideas of how to refactor such query on Hive?

Upvotes: 1

Views: 1252

Answers (1)

doublebyte
doublebyte

Reputation: 1275

I'll answer my own question, to state that my query attempt was correct: it was failing for other reasons, that have nothing to do with syntax.

SELECT count(grid.geom) as ptcnt FROM grid JOIN points WHERE
ST_Contains(grid.geom,points.geom) GROUP BY grid.geom;

This is actually equivalent to writing:

SELECT count(grid.geom) as ptcnt FROM grid JOIN points ON(TRUE) WHERE
ST_Contains(grid.geom,points.geom) GROUP BY grid.geom;

Hive does not in fact support non-equality joins, so I set the join condition to "TRUE", and filter my results on the "WHERE" clause.

Apparently this is a pretty standard behaviour for spatial joins, as you can read here.

Upvotes: 2

Related Questions