Reputation: 1084

Count distinct values with OVER(PARTITION BY id)

Is it possible to count distinct values in conjunction with window functions like OVER(PARTITION BY id)? Currently my query is as follows:

SELECT congestion.date, congestion.week_nb, congestion.id_congestion,
   congestion.id_element,
ROW_NUMBER() OVER(
    PARTITION BY congestion.id_element
    ORDER BY congestion.date),
COUNT(DISTINCT congestion.week_nb) OVER(
    PARTITION BY congestion.id_element
) AS week_count
FROM congestion
WHERE congestion.date >= '2014.01.01'
AND congestion.date <= '2014.12.31'
ORDER BY id_element, date

However, when I try to execute the query I get the following error:

"COUNT(DISTINCT": "DISTINCT is not implemented for window functions"

Upvotes: 32

Answers (6)

Arsene Xie

Reputation: 24

Yes, "DISTINCT is not implemented for window functions".

Another solution is to do DENSE_RANK() OVER(PARTITION BY ORDER BY) and MAX() OVER(PARTITION BY) instead.

e.g: for the purpose COUNT(DISTINCT week_nb) OVER(PARTITION BY id_element) week_count:

SELECT  s.date, 
        s.week_nb, 
        s.id_congestion, 
        s.id_element,
        MAX(s.week_dense_rank) OVER(PARTITION BY s.id_element) week_count
FROM
( 
   SELECT date,
          week_nb,
          id_congestion,
          id_element,
          DENSE_RANK() OVER(PARTITION BY id_element ORDER BY week_nb) week_dense_rank
   FROM   congestion
   WHERE  date >= '2014.01.01'
   AND    date <= '2014.12.31' 
) s

Upvotes: 0

NiFiNiTe

Reputation: 69

If you are counting distinct numbers, you can use other aggregate functions to acheive the same effect, like so.

select
    initial.id,
    initial.val,
    joined.id,
    array_length(uniq(sort(array_agg(joined.some_number) over (partition by initial.id))), 1) as distinct_count
from
    (values (1,'a'), (2,'b'), (3,'c')) initial(id, val)
        left join (values (1, 1),
                          (1, 1),
                          (1, 3),
                          (2, 2),
                          (2, 2),
                          (3, 3),
                          (3, 3),
                          (3, 3),
                          (3, 4)) joined(id, some_number) on joined.id = initial.id
;

id  val id  distinct_count
1   a   1   2
1   a   1   2
1   a   1   2
2   b   2   1
2   b   2   1
3   c   3   2
3   c   3   2
3   c   3   2
3   c   3   2

If you get an error message saying uniq doesn't exist, it's in the intarray extension. Try running CREATE EXTENSION intarray first and then try again. If you are still having trouble, search the Postgresql docs for info on the intarray extension. "[The] module is considered 'trusted', that is, it can be installed by non-superusers who have CREATE privilege on the current database."

Upvotes: 1

Gordon Linoff

Reputation: 1269443

I find that the easiest way is to use a subquery/CTE and conditional aggregation:

SELECT
  c.date,
  c.week_nb,
  c.id_congestion,
  c.id_element,
  ROW_NUMBER() OVER (PARTITION BY c.id_element ORDER BY c.date),
  (
    CASE WHEN seqnum = 1 THEN
      1
    ELSE
      0
    END) AS week_count
FROM (
  SELECT
    c.*,
    ROW_NUMBER() OVER (PARTITION BY c.congestion.id_element, c.week_nb ORDER BY c.date) AS seqnum
  FROM
    congestion c) c
WHERE
  c.date >= '2014.01.01'
  AND c.date <= '2014.12.31'
ORDER BY
  id_element,
  date

Upvotes: 2

Vincent

Reputation: 8796

Since this is the first result that pops up from Google, I'll add this reproducible example, similar to Gordon's answer:

Let's first start with creating a sample table:

WITH test as 
(
SELECT * 
FROM (VALUES
(1, 'A'),
(1, 'A'),
(2, 'B'),
(2, 'B'),
(2, 'D'),
(3, 'C'),
(3, 'C'),
(3, 'C'),
(3, 'E'),
(3, 'F')) AS t (id_element, week_nb)
)

select * from test

This yields:

id_element week_nb
1   A
1   A
2   B
2   B
2   D
3   C
3   C
3   C
3   E
3   F

Then, doing something like:

select 
  id_element,
  week_nb,
  sum(first_row_in_sequence) over (partition by id_element) as distinct_week_nb_count
from 
(
select 
  id_element,
  week_nb,
  case when row_number() over (partition by id_element, week_nb) = 1 then 1 else 0 end as first_row_in_sequence
from test
) as sub

yields

id_element week_nb distinct_week_nb_count
1   A   1
1   A   1
2   B   2
2   B   2
2   D   2
3   C   3
3   C   3
3   C   3
3   E   3
3   F   3

Upvotes: 2

Agnius Vasiliauskas

Reputation: 11267

Make partitioned set smaller, up to the point there is no duplicates over counted field :

SELECT congestion.date, congestion.week_nb, congestion.id_congestion,
   congestion.id_element,
ROW_NUMBER() OVER(
    PARTITION BY congestion.id_element
    ORDER BY congestion.date),
COUNT(congestion.week_nb) -- remove distinct 
OVER(
    PARTITION BY congestion.id_element,
                 -- add new fields which will restart counter in case duplication
                 congestion.id_congestion
) AS week_count
FROM congestion
WHERE congestion.date >= '2014.01.01'
AND congestion.date <= '2014.12.31'
ORDER BY id_element, date

Upvotes: 1

Simo Kivistö

Reputation: 4453

No, as the error message states, DISTINCT is not implemented with windows functions. Aplying info from this link into your case you could use something like:

WITH uniques AS (
 SELECT congestion.id_element, COUNT(DISTINCT congestion.week_nb) AS unique_references
 FROM congestion
WHERE congestion.date >= '2014.01.01'
AND congestion.date <= '2014.12.31'
 GROUP BY congestion.id_element
)

SELECT congestion.date, congestion.week_nb, congestion.id_congestion,
   congestion.id_element,
ROW_NUMBER() OVER(
    PARTITION BY congestion.id_element
    ORDER BY congestion.date),
uniques.unique_references AS week_count
FROM congestion
JOIN uniques USING (id_element)
WHERE congestion.date >= '2014.01.01'
AND congestion.date <= '2014.12.31'
ORDER BY id_element, date

Depending on the situation you could also put a subquery straight into SELECT-list:

SELECT congestion.date, congestion.week_nb, congestion.id_congestion,
   congestion.id_element,
ROW_NUMBER() OVER(
    PARTITION BY congestion.id_element
    ORDER BY congestion.date),
(SELECT COUNT(DISTINCT dist_con.week_nb)
    FROM congestion AS dist_con
    WHERE dist_con.date >= '2014.01.01'
    AND dist_con.date <= '2014.12.31'
    AND dist_con.id_element = congestion.id_element) AS week_count
FROM congestion
WHERE congestion.date >= '2014.01.01'
AND congestion.date <= '2014.12.31'
ORDER BY id_element, date

Upvotes: 10

Count distinct values with OVER(PARTITION BY id)

Answers (6)

Related Questions