shanemgrey
shanemgrey

Reputation: 2378

Select value from lower rank when null

I'm trying to measure net change from the first recording of a survey against the latest recording of the same question on that same survey, given later surveys are usually incomplete and thus have nulls.

Survey answers for the same participant:

┌─────────────────────────────────────┐
| p_id | rank | val_a | val_b | val_c |
|    2 |    1 |     1 |     2 |     3 |
|    2 |    2 |     2 |       |       |
|    2 |    3 |     4 |     4 |     1 |
|    2 |    4 |     4 |     3 |       |
└─────────────────────────────────────┘

Desired output:
┌──────────────────────────────┐
| p_id | val_a | val_b | val_c |
|    2 | 3     | 1     |    -2 |
└──────────────────────────────┘

a = row4 - row1
b = row4 - row1 
c = row3 - row1 (uses the rank3 value since rank4 has none)

The result should show the difference between the value of the column of the highest ranked row which is not null, and the value of the that same column in the first row which should never be null.

So far I have the code for the difference between two rows, but can't figure out how to account for null values when there is a not null value in a lower rank that could be used.

SELECT
    p_id,
    CASE WHEN ("p_val_a" IS NOT null) AND (rank != 1) AND ("val_a" IS NOT null) THEN "val_a" - "p_val_a" ELSE NULL END as "diff_val_a",
    CASE WHEN ("p_val_b" IS NOT null) AND (rank != 1) AND ("val_b" IS NOT null) THEN "val_b" - "p_val_b" ELSE NULL END AS "diff_val_b",
    CASE WHEN ("p_val_c" IS NOT null) AND (rank != 1) AND ("val_c" IS NOT null) THEN "val_c" - "p_val_c" ELSE NULL END AS "diff_val_c"
FROM
    (
        SELECT
            p_id,
            "val_a",
            "val_b",
            "val_c",
            LAG("val_a") OVER w AS "p_val_a",
            LAG("val_b") OVER w as "p_val_b",
            LAG("val_c") OVER w as "p_val_c"
        FROM
            dataset WINDOW w AS (
                PARTITION BY 
                    p_id
                ORDER BY
                    rank
            )
    ) t;

In the example above if I query only the first and last rows, val_a and val_b produce correct results. But val_c would produce null instead of -2 as I want.

How can I compare the first row values to the value of the same column from the row with the latest not null value?

Upvotes: 1

Views: 59

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1270401

I would use first_value() and last_value():

select distinct p_id,
       (first_value(val_a) over (partition by p_id order by (val_a is not null)::int desc, rank desc) -
        first_value(val_a) over (partition by p_id order by (val_a is not null)::int desc, rank asc)
       ) as a_diff,
       (first_value(val_b) over (partition by p_id order by (val_b is not null)::int desc, rank desc) -
        first_value(val_b) over (partition by p_id order by (val_b is not null)::int desc, rank asc)
       ) as b_diff,
       (first_value(val_c) over (partition by p_id order by (val_c is not null)::int desc, rank desc) -
        first_value(val_c) over (partition by p_id order by (val_c is not null)::int desc, rank asc)
       ) as c_diff
from t;

Here is a SQL Fiddle.

Upvotes: 1

MatBailie
MatBailie

Reputation: 86765

Analytic functions can be used to find out which row (rank) has the first occurrence of a non-NULL value, and the same again to find the final occurrence.

Then conditional aggregation can pick out those values.

http://sqlfiddle.com/#!17/78886/9

WITH
  analysed
AS
(
  SELECT
    *,
    MIN(CASE WHEN val_a IS NOT NULL THEN rank END) OVER ranked_pid   AS first_a_pos,
    MIN(CASE WHEN val_b IS NOT NULL THEN rank END) OVER ranked_pid   AS first_b_pos,
    MIN(CASE WHEN val_c IS NOT NULL THEN rank END) OVER ranked_pid   AS first_c_pos,
    MAX(CASE WHEN val_a IS NOT NULL THEN rank END) OVER ranked_pid   AS final_a_pos,
    MAX(CASE WHEN val_b IS NOT NULL THEN rank END) OVER ranked_pid   AS final_b_pos,
    MAX(CASE WHEN val_c IS NOT NULL THEN rank END) OVER ranked_pid   AS final_c_pos
  FROM
    test
  WINDOW
    ranked_pid AS (
      PARTITION BY p_id
  --      ORDER BY rank
  --  ROWS BETWEEN unbounded preceding
  --           AND unbounded following
    )
)
SELECT
  p_id,
  MAX(CASE WHEN rank = final_a_pos THEN val_a END) - MAX(CASE WHEN rank = first_a_pos THEN val_a END)  AS change_in_a,
  MAX(CASE WHEN rank = final_b_pos THEN val_b END) - MAX(CASE WHEN rank = first_b_pos THEN val_b END)  AS change_in_b,
  MAX(CASE WHEN rank = final_c_pos THEN val_c END) - MAX(CASE WHEN rank = first_c_pos THEN val_c END)  AS change_in_c
FROM
  analysed
GROUP BY
  p_id
ORDER BY
  p_id

EDIT:

Commented out part of the window definition that's not needed. It was there while I was playing with FIRST_VALUE() and LAST_VALUE() (but postgreSQL doesn't support IGNORE NULLS)

Upvotes: 1

Related Questions