Bee
Bee

Reputation: 12512

In hive, how to do a calculation among 2 rows?

I have this table.

   +------------------------------------------------------------+
   |     ks      |      time     |     val1      |    val2      | 
   +-------------+---------------+---------------+--------------+
   |     A       |       1       |       1       |      1       |
   |     B       |       1       |       3       |      5       |
   |     A       |       2       |       6       |      7       |
   |     B       |       2       |      10       |     12       |
   |     A       |       4       |       6       |      7       |
   |     B       |       4       |      20       |     26       |
   +------------------------------------------------------------+

What I want to get is for each row,

ks |  time |  val1 | val1 of next ts of same ks  |

To be clear, result of above example should be,

   +------------------------------------------------------------+
   |     ks      |      time     |     val1      |   next.val1  | 
   +-------------+---------------+---------------+--------------+
   |     A       |       1       |       1       |       6      |
   |     B       |       1       |       3       |       10     |
   |     A       |       2       |       6       |       6      |
   |     B       |       2       |      10       |       20     |
   |     A       |       4       |       6       |      null    |
   |     B       |       4       |      20       |      null    |
   +------------------------------------------------------------+

(I need the same next for value2 as well)

I tried a lot to come up with a hive query for this, but still no luck. I was able to write a query for this in sql as mentioned here (Quassnoi's answer), but couldn't create the equivalent in hive because hive doesn't support subqueries in select.

Can someone please help me achieve this?

Thanks in advance.

EDIT:

Query I tried was,

SELECT ks, time, val1, next[0] as next.val1 from
(SELECT ks, time, val1
       COALESCE(
       (
       SELECT Val1, time
       FROM myTable mi
       WHERE mi.val1 > m.val1 AND mi.ks = m.ks
       ORDER BY time
       LIMIT 1
       ), CAST(0 AS BIGINT)) AS next
FROM  myTable m
ORDER BY time) t2;

Upvotes: 2

Views: 4677

Answers (2)

libjack
libjack

Reputation: 6443

I find that using Hive custom map/reduce functionality works great to solve queries similar to this. It gives you the opportunity to consider a set of input and "reduce" to one (or more) results.

This answer discusses the solution.

The key is that you use CLUSTER BY to send all results with similar key value to the same reducer, hence same reduce script, collect accordingly, and then output the reduced results when the key changes, and start collecting for the new key.

Upvotes: 2

Lukas Vermeer
Lukas Vermeer

Reputation: 5940

Your query seems quite similar to the "year ago" reporting that is ubiquitous in financial reporting. I think a LEFT OUTER JOIN is what you are looking for.

We join table myTable to itself, naming the two instances of the same table m and n. For every entry in the first table m we will attempt to find a matching record in n with the same ks value but an incremented value of time. If this record does not exist, all column values for n will be NULL.

SELECT 
    m.ks, 
    m.time,
    m.val1, 
    n.val1 as next_val1,
    m.val2, 
    n.val2 as next_val2
FROM 
    myTable m
LEFT OUTER JOIN
    myTable n
ON (
    m.ks = n.ks
AND 
    m.time + 1 = n.time
);

Returns the following.

ks  time  val1  next_val1  val2  next_val2
A   1     1     6          1     7
A   2     6     6          7     7
A   3     6     NULL       7     NULL
B   1     3     10         5     12
B   2     10    20         12    26
B   3     20    NULL       26    NULL

Hope that helps.

Upvotes: 2

Related Questions