Jim Pivarski
Jim Pivarski

Reputation: 5974

How do I reuse expressions in a Hive (HQL) query?

I'm a newcomer to both Hive and SQL, so this may be an easy question, but I haven't found the solution with a few hours of web searches.

I'm writing expressions in my select query that I want to use in other columns of the select query. For instance,

select (-b + sqrt(b*b - 4*a*c))/(2*a), (-b - sqrt(b*b - 4*a*c))/(2*a) from tablename;

Apart from the slight inefficiency of recalculating the discriminant (sqrt(b*b - 4*a*c)), this sort of thing is hard for humans to read. If this were a program, I'd store the partial result disc = sqrt(b*b - 4*a*c) like this:

select [somehow assign disc] (-b + disc)/(2*a), (-b - disc)/(2*a) from tablename;

What I've been able to find on "temporary variables" and "table variables" looks like a separate query that gets saved to an intermediate table and folded into the main query, which is way to much overhead for something that should have zero overhead. If Hive wanted to, it could expand all of my expressions out before submitting the query (though I'd prefer it to actually avoid recalculation).

What's the name of the thing that I'm looking for? How can I rewrite the above example to only show sqrt(b*b - 4*a*c) once and only submit one map-reduce job (no intermediate, temporary tables)?

(My actual case is more complicated and more computationally intensive.)

Upvotes: 1

Views: 3147

Answers (1)

jbaptiste
jbaptiste

Reputation: 371

If the point is not to rewrite the whole code each time you use your disc, you can create a Hive UDF (user defined function). I'm not sure whether the functions results are cached / if ther is any optimisation on multiple function calls with the same arguments.

If the point is to compute the result of "disc" only once by line, you may use a common table expression

with q1 as ( select sqrt(b*b - 4*a*c) as disc, a,b,c from tablename)

select (-b + disc)/(2*a), (-b - disc)/(2*a) from q1;

You may also use a subquery

select (-b + disc)/(2*a), (-b - disc)/(2*a) from

( select sqrt(b*b - 4*a*c) as disc, a,b,c from tablename) q1


UPDATE :

I compared the performance of the 3 queries (select with recalculation, CTE and subquery), and execution times are about the same (18,91, 18,78, 18,94).

Hive execution engine seems to be clever !

So, the only consideration you may care about, is the readability of the code. (Unless your tests, on a more complex process show otherwise, and I'd be glad to hear about)

Upvotes: 1

Related Questions