roody
roody

Reputation: 2663

Find last occurring value within record in PostgreSQL

I'm not new to SQL, but I am new to PostgreSQL and am really struggling to adapt my current knowledge in a different environment.

I am trying to create a variable that captures whether or not someone stays active, skips, or churns within a 0/1 time series variable. For example, in the data below, my dataset would include the variables id,time, and voted, and I would create the variable "skipped":

id     time     voted     skipped
1      1        1         active
1      2        0         skipped
1      3        1         active
2      1        1         active
2      2        0         churned
2      3        0         churned
3      1        1         active
3      2        1         active
3      3        0         churned

The rule for coding "skipped" is pretty simple: If 1 is the last record, the person is "active" and any zeroes count as "skipped", but if 0 is the last record, the person is "churned".

The record with id = 1 is a skip because id is non-zero at time 3 after being 0 at time 2. The other two cases, 0 is the final value so they are "churned". Can anyone help? I've been noodling on it all day, and am hitting a wall.

Upvotes: 1

Views: 84

Answers (2)

selwyth
selwyth

Reputation: 2497

Window functions can help for readability when working with self-referential joins.

WITH
  add_last_voted_status AS (
    SELECT
      *
    , LAST_VALUE(voted) OVER (
        PARTITION BY id
        ORDER BY time
      ) AS last_voted_status
    FROM table
)

SELECT
  id
, time
, voted
, CASE
    WHEN last_voted_status = 0
      THEN 'churned'
    WHEN last_voted_status = 1 AND voted = 1
      THEN 'active'
    WHEN last_voted_status = 1 AND voted = 0
      THEN 'skipped'
    ELSE '?'
  END AS skipped
FROM add_last_voted_status

Upvotes: 1

Hambone
Hambone

Reputation: 16377

This isn't particularly elegant, but it should meet your needs:

with votes as (
  select
    id, time, voted,
    max(time) over (partition by id) as max_time
  from voter_data
)
select
  v1.id, v1.time, v1.voted,
  case
    when v1.voted = 1 then 'active'
    when v2.voted = 1 then 'skipped'
    else 'churned'
  end as skipped
from
  votes v1
  join votes v2 on
    v1.id = v2.id and
    v1.max_time = v2.time

In a nutshell, we first figure out which is the last record for each voter id, and then we do a self-join on the resulting table to isolate only that last id.

There is a chance this could produce multiple results -- if it's possible to have the same ID vote twice at the same time. If that's the case, you want row_number() instead of max().

Results on your data:

1   1   1   'active'
1   2   0   'skipped'
1   3   1   'active'
2   1   1   'active'
2   2   0   'churned'
2   3   0   'churned'
3   1   1   'active'
3   2   1   'active'
3   3   0   'churned'

Upvotes: 3

Related Questions