Reputation: 1561
Question: How do I fix this? Tried casting and all sorts of tricks but nothing seems to fix it while keeping the table the same.
Bonus Question: Why is this happening, and does it happen in other RDBMS? Should I be regretting Snowflake?
My query has a problem where something that should be equal to 0, is showing up as a very, very, small number. I can reproduce the issue on a sample dataset:
Setup to try it yourself:
drop table if exists "TESTBUG";
create table "TESTBUG" (id number, val float);
insert into "TESTBUG" values(1,0.000);
insert into "TESTBUG" values(2,0.000);
insert into "TESTBUG" values(3,0.001);
insert into "TESTBUG" values(4,0.000);
insert into "TESTBUG" values(5,0.000);
insert into "TESTBUG" values(6,0.000);
insert into "TESTBUG" values(7,0.000);
insert into "TESTBUG" values(8,0.000);
insert into "TESTBUG" values(9,0.000);
insert into "TESTBUG" values(10,0.000);
insert into "TESTBUG" values(11,0.000);
insert into "TESTBUG" values(12,0.000);
insert into "TESTBUG" values(13,0.000);
insert into "TESTBUG" values(14,0.000);
insert into "TESTBUG" values(15,0.000);
insert into "TESTBUG" values(16,0.000);
insert into "TESTBUG" values(17,0.000);
insert into "TESTBUG" values(18,0.000);
insert into "TESTBUG" values(19,0.000);
insert into "TESTBUG" values(20,0.000);
What we're dealing with here is 20 fake rows of data:
ID | VAL |
---|---|
1 | 0 |
2 | 0 |
3 | 0.001 |
4 | 0 |
5 | 0 |
6 | 0 |
7 | 0 |
8 | 0 |
9 | 0 |
10 | 0 |
11 | 0 |
12 | 0 |
13 | 0 |
14 | 0 |
15 | 0 |
16 | 0 |
17 | 0 |
18 | 0 |
19 | 0 |
20 | 0 |
Here is the SQL which produces a non-zero result. It looks like the MOVING_AVG
is the culprit? I am not sure.
Also it's strange that when you look at the results, it is only ID=18 that has a non-zero result. Row 19 should effectively be the same since the window is only 14 periods long.
WITH LAG AS (
SELECT *,
LAG(val,1) OVER(ORDER BY id) AS lag_val
FROM "RASGO.PUBLIC.TESTBUG"
),
DIFF AS (
SELECT *,
val - lag_val as diff_lag_val
from LAG
),
MOVING_AVG AS (
SELECT *,
avg(diff_lag_val) OVER(ORDER BY id ROWS BETWEEN 14 PRECEDING AND CURRENT ROW) AS moving_avg_diff
FROM DIFF
)
SELECT * FROM MOVING_AVG WHERE id > 14 AND moving_avg_diff < 0
ID | VAL | LAG_VAL | DIFF_LAG_VAL | MOVING_AVG_DIFF |
---|---|---|---|---|
18 | 0 | 0 | 0 | -6.666666667e-05 |
Upvotes: 0
Views: 467
Reputation: 25903
So packing it into one block:
with testbug(id, val) as (
select * from values
(1, 0.000::float),
(2, 0.000::float),
(3, 0.001::float),
(4, 0.000::float),
(5, 0.000::float),
(6, 0.000::float),
(7, 0.000::float),
(8, 0.000::float),
(9, 0.000::float),
(10, 0.000::float),
(11, 0.000::float),
(12, 0.000::float),
(13, 0.000::float),
(14, 0.000::float),
(15, 0.000::float),
(16, 0.000::float),
(17, 0.000::float),
(18, 0.000::float),
(19, 0.000::float),
(20, 0.000::float)
), diff AS (
SELECT
*,
LAG(val,1) OVER(ORDER BY id) AS lag_val,
val - lag_val as diff_lag_val
from TESTBUG
)
SELECT
*,
avg(diff_lag_val) OVER(ORDER BY id ROWS BETWEEN 14 PRECEDING AND CURRENT ROW) AS moving_avg_diff
FROM DIFF
--QUALIFY id > 14 AND moving_avg_diff < 0
order by id;
gives:
ID | VAL | LAG_VAL | DIFF_LAG_VAL | MOVING_AVG_DIFF |
---|---|---|---|---|
1 | 0 | |||
2 | 0 | 0 | 0 | 0 |
3 | 0.001 | 0 | 0.001 | 0.0005 |
4 | 0 | 0.001 | -0.001 | 0 |
5 | 0 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 0 |
7 | 0 | 0 | 0 | 0 |
8 | 0 | 0 | 0 | 0 |
9 | 0 | 0 | 0 | 0 |
10 | 0 | 0 | 0 | 0 |
11 | 0 | 0 | 0 | 0 |
12 | 0 | 0 | 0 | 0 |
13 | 0 | 0 | 0 | 0 |
14 | 0 | 0 | 0 | 0 |
15 | 0 | 0 | 0 | 0 |
16 | 0 | 0 | 0 | 0 |
17 | 0 | 0 | 0 | 0 |
18 | 0 | 0 | 0 | -0.000066 |
19 | 0 | 0 | 0 | 0 |
20 | 0 | 0 | 0 | 0 |
so the problem is on line 4 where you get a negative value, and then the moving average for all lines prior to 18 the positive and negative value cancel out..
which can be shown with a smaller toy problem:
with testbug(id, val) as (
select * from values
(1, 0.000::float),
(2, 0.000::float),
(3, 0.001::float),
(4, 0.000::float),
(5, 0.000::float),
(6, 0.000::float)
), diff AS (
SELECT
*,
LAG(val,1) OVER(ORDER BY id) AS lag_val,
val - lag_val as diff_lag_val
from TESTBUG
)
SELECT
*,
avg(diff_lag_val) OVER(ORDER BY id ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg_diff
FROM DIFF
order by id;
ID | VAL | LAG_VAL | DIFF_LAG_VAL | MOVING_AVG_DIFF |
---|---|---|---|---|
1 | 0 | |||
2 | 0 | 0 | 0 | 0 |
3 | 0.001 | 0 | 0.001 | 0.0005 |
4 | 0 | 0.001 | -0.001 | 0 |
5 | 0 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | -0.0003333333333 |
SELECT
*,
sum(diff_lag_val) OVER(ORDER BY id ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as moving_sum_diff,
count(diff_lag_val) OVER(ORDER BY id ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as moving_countdiff,
avg(diff_lag_val) OVER(ORDER BY id ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS moving_avg_diff,
div0(moving_sum_diff, moving_countdiff) as man_avg
FROM DIFF
order by id;
we now add the manual steps to do the average by hand, and we get:
ID | VAL | LAG_VAL | DIFF_LAG_VAL | MOVING_SUM_DIFF | MOVING_COUNTDIFF | MOVING_AVG_DIFF | MAN_AVG |
---|---|---|---|---|---|---|---|
1 | 0 | 0 | |||||
2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | 0.001 | 0 | 0.001 | 0.001 | 2 | 0.0005 | 0.0005 |
4 | 0 | 0.001 | -0.001 | 0 | 3 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 3 | 0 | 0 |
6 | 0 | 0 | 0 | -0.001 | 3 | -0.0003333333333 | -0.0003333333333 |
so the math is correct.
So this is not a floating point number representation problem, it is a you are doing math over numbers that are not how you expected them to be, but what is happening makes sense, albeit, it might not be what you are wanting to do.
aka here is what is happening:
select column1, column2, div0(column1,column2)
from values
( 0, 18 ),
( 0.001, 2 ),
( -0.001, 15 );
COLUMN1 | COLUMN2 | DIV0(COLUMN1,COLUMN2) |
---|---|---|
0 | 18 | 0 |
0.001 | 2 | 0.0005 |
-0.001 | 15 | -0.000066666 |
Upvotes: 1