yanachen
yanachen

Reputation: 3753

Hive time window function's bug

I have a table named gmv_active_mem_monthly. The whole rows can been seen here:

month   gmv_monthly active_member_monthly
201612  231657626042    2602064
201611  373576915733    3498039
201610  367824193757    3648708
201609  356167649082    3686007
201608  383362147243    3998595
201607  383828659139    3917252
201606  332929299345    3627298
201605  323084120955    3579938
201604  280834688208    3293682
201603  282180201106    3316420
201602  246386923468    3097107
201601  261355415707    3186347
201512  273860930491    3071105
201511  246606316046    2981534
201510  237766306308    2873558
201509  160390583711    2267418
201508  124370765573    2002018
201507  110236706032    1855539
201506  84844225170 1467889
201505  60651906632 1180800
201504  46808796126 917681
201503  12498656329 427529
201502  4918371362  190932
201501  2824293727  129203

I run a simple code in hive:

select  month,
        sum(gmv_monthly) over
        (
            order by  "month"
            rows      between 12 preceding and 1 preceding
        ) as total_gmv,
        sum(active_member_monthly) over
        (
            order by  "month"
            rows      between 12 preceding and 1 preceding
        ) as total_active_mem

from    novaya.gmv_active_mem_monthly 
;

But the result is total wrong while I use the same code on another dataset it is right. The result on the dataset above is :

month   total_gmv   total_active_mem
201501  NULL    NULL
201502  2824293727  129203
201503  7742665089  320135
201504  20241321418 747664
201505  67050117544 1665345
201506  127702024176    2846145
201507  212546249346    4314034
201508  322782955378    6169573
201509  447153720951    8171591
201510  607544304662    10439009
201511  845310610970    13312567
201512  1091916927016   16294101
201601  1365777857507   19365206
201602  1624308979487   22422350
201603  1865777531593   25328525
201604  2135459076370   28217416
201605  2369484968452   30593417
201606  2631917182775   32992555
201607  2880002256950   35151964
201608  3153594210057   37213677
201609  3412585591727   39210254
201610  3608362657098   40628843
201611  3738420544547   41403993
201612  3865391144234   41920498

We can check that 1624308979487 from 201602 minus 1365777857507 from 201601 is not equal to 201601's value in gmv_active_mem_monthly. So what's wrong with the code? The code runs perfect on another dataset without error like this.

Upvotes: 1

Views: 190

Answers (1)

David דודו Markovitz
David דודו Markovitz

Reputation: 44991

There is no problem. The results are correct.
The difference is not not with 1 month, it is with 2 months, one of each edge of the range.

201502  4,918,371,362          <-- This value goes only with 201601  
201503  12,498,656,329         
201504  46,808,796,126 
201505  60,651,906,632 
201506  84,844,225,170 
201507  110,236,706,032 
201508  124,370,765,573 
201509  160,390,583,711 
201510  237,766,306,308 
201511  246,606,316,046 
201512  273,860,930,491 
201601  261,355,415,707        <-- This value goes only with 201602
201602  

Upvotes: 1

Related Questions