Reputation: 23
when i use RANGE to specify the window in hive, I get some confused results.
there is a test table.
select
id,
val,
sum(val) over(order by val rows between unbounded preceding and
current row) rows_sum,
sum(val) over(order by val range between unbounded preceding and
current row) range_sum
from test
here is the result of above query. and this is also my expecting results
id | val | rows_sum | range_sum |
---|---|---|---|
1 | 1 | 1 | 2 |
2 | 1 | 2 | 2 |
3 | 3 | 5 | 5 |
4 | 6 | 11 | 23 |
5 | 6 | 17 | 23 |
6 | 6 | 23 | 23 |
but for range_sum field, if i change the order rule from asc to desc. say
sum(val) over(order by val desc range between unbounded preceding and
current row) range_sum
here is the result
val | range_sum |
---|---|
6 | 18 |
6 | 18 |
6 | 18 |
3 | 21 |
1 | 23 |
1 | 23 |
but my expecting results for range_sum is
val | range_sum |
---|---|
6 | 18 |
6 | 18 |
6 | 18 |
3 | NULL |
1 | NULL |
1 | NULL |
There’re two ways of defining frame in Hive, ROWS AND RANGE.for example ,SUM(val) RANGE BETWEEN 100 PRECEDING AND 200 FOLLOWING selects rows by the distance from the current row’s value. Say current val is 200, and this frame will includes rows whose val values range from 100 to 400
so in my example above:range between unbounded preceding and current row
when the val is 6, the frame include 3 rows with val = 6, so the sum is 18.
but when we consider the 4th row, which val is 3. because the ordering rule is desc, the val of the UNBOUNDED PRECEDING row is 6, and CURRENT ROW IS 3.The frame should include rows whose val is between 6 and 3, and no rows satisfy this condition. but the query result is 21 . i mean it likes between 3 and 6 , not between 6 and 3.
Upvotes: 2
Views: 556
Reputation: 38325
It works as designed and according to the standard, the same behavior is in other databases.
It is easier to find specification for Hive and other databases like Oracle than standard document (for free). For example see "Windowing Specifications in HQL" and Oracle "Window Function Frame Specification"
First the partition is ordered, then bounds are calculated and frame between bounds is used. Frame is taken according to the ORDER BY, not always >=bound1
and <=bound2
.
For order by DESC Bound1>=Row.value>=Bound2
. Frame includes rows from the partition start through the current row, including all peers of the current row (rows equal to the current row according to the ORDER BY clause).
For order by ASC Bound1<=Row.value<=Bound2
.
UNBOUNDED PRECEDING:
The bound (bound1) is the first partition row (according to the order).
CURRENT ROW:
For ROWS, the bound (bound2) is the current row. For RANGE, the bound is the peers of the current row (rows with the same value as current row).
Also read this excellent explanation from Sybase:
The sort order of the ORDER BY values is a critical part of the test for qualifying rows in a value-based frame; the numeric values alone do not determine exclusion or inclusion
Upvotes: 2