jianguo xue
jianguo xue

Reputation: 23

In hive window, what would happen if the value of CURRENT ROW is smaller than that of UNBOUNDED PRECEDING

when i use RANGE to specify the window in hive, I get some confused results.

there is a test table.

select
  id,
  val,
  sum(val) over(order by val rows between unbounded preceding and 
   current row) rows_sum,
  sum(val) over(order by val range between unbounded preceding and 
   current row) range_sum
from test

here is the result of above query. and this is also my expecting results

id val rows_sum range_sum
1 1 1 2
2 1 2 2
3 3 5 5
4 6 11 23
5 6 17 23
6 6 23 23

but for range_sum field, if i change the order rule from asc to desc. say

      sum(val) over(order by val desc range between unbounded preceding and 
   current row) range_sum

here is the result

val range_sum
6 18
6 18
6 18
3 21
1 23
1 23

but my expecting results for range_sum is

val range_sum
6 18
6 18
6 18
3 NULL
1 NULL
1 NULL

There’re two ways of defining frame in Hive, ROWS AND RANGE.for example ,SUM(val) RANGE BETWEEN 100 PRECEDING AND 200 FOLLOWING selects rows by the distance from the current row’s value. Say current val is 200, and this frame will includes rows whose val values range from 100 to 400

so in my example above:range between unbounded preceding and current row

when the val is 6, the frame include 3 rows with val = 6, so the sum is 18.

but when we consider the 4th row, which val is 3. because the ordering rule is desc, the val of the UNBOUNDED PRECEDING row is 6, and CURRENT ROW IS 3.The frame should include rows whose val is between 6 and 3, and no rows satisfy this condition. but the query result is 21 . i mean it likes between 3 and 6 , not between 6 and 3.

Upvotes: 2

Views: 556

Answers (1)

leftjoin
leftjoin

Reputation: 38325

It works as designed and according to the standard, the same behavior is in other databases.

It is easier to find specification for Hive and other databases like Oracle than standard document (for free). For example see "Windowing Specifications in HQL" and Oracle "Window Function Frame Specification"

First the partition is ordered, then bounds are calculated and frame between bounds is used. Frame is taken according to the ORDER BY, not always >=bound1 and <=bound2.

For order by DESC Bound1>=Row.value>=Bound2. Frame includes rows from the partition start through the current row, including all peers of the current row (rows equal to the current row according to the ORDER BY clause).

For order by ASC Bound1<=Row.value<=Bound2.

UNBOUNDED PRECEDING:

The bound (bound1) is the first partition row (according to the order).

CURRENT ROW:

For ROWS, the bound (bound2) is the current row. For RANGE, the bound is the peers of the current row (rows with the same value as current row).

Also read this excellent explanation from Sybase:

The sort order of the ORDER BY values is a critical part of the test for qualifying rows in a value-based frame; the numeric values alone do not determine exclusion or inclusion

Upvotes: 2

Related Questions