tmul
tmul

Reputation: 23

Count distinct based on another column

I have the following table:

CREATE TABLE tbl (
  id          int NOT NULL
, date        date NOT NULL
, cid         int NOT NULL
, birth_place text NOT NULL
, location    text NOT NULL
);
 
INSERT INTO tbl VALUES
  (1 , '2022-01-01', 1, 'France' , 'Germany')
, (2 , '2022-01-30', 1, 'France' , 'France')
, (3 , '2022-01-25', 2, 'Spain'  , 'Spain')
, (4 , '2022-01-12', 3, 'France' , 'France')
, (5 , '2022-02-01', 4, 'England', 'Italy')
, (6 , '2022-02-12', 1, 'France' , 'France')
, (7 , '2022-03-05', 5, 'Spain'  , 'England')
, (8 , '2022-03-08', 2, 'Spain'  , 'Spain')
, (9 , '2022-03-15', 2, 'Spain'  , 'Spain')
, (10, '2022-03-30', 5, 'Spain'  , 'Italy')
, (11, '2022-03-22', 4, 'England', 'England')
, (12, '2022-03-22', 3, 'France' , 'England');

I need to count distinct customers (= cid) per month and location - with a special twist:
If a customer they went back to their birth place (location = birth_place) during any given month, give priority to that location. Else pick one location per month and customer arbitrarily.

My desired output:

date         location   count
2022-01-01   France     2
2022-01-01   Spain      1
2022-02-01   Italy      1
2022-02-01   France     1
2022-03-01   Spain      1
2022-03-01   England    3

cid 1 in 2022-01-01 had a location = birth_place, and no other customer had Germany as location in that time period, hence there is no Germany in my desired output location.

This is my current query:

with
  t as (
    select id, date_trunc('month', date)::date AS date, cid, birth_place, location
    from tbl),
  t1 as (
    select date, cid, location
    from t
    where birth_place = location),
  t2 as (
    select date, cid, location, row_number() over (partition by date, cid order by date) as row
    from t
    where birth_place <> location),
  t3 as (
    select t.*, 
        case
            when t1.location is not null then t1.location
            else t2.location
        end as new_loc
    from t
    left join t1
    on t.cid = t1.cid and t.date = t1.date
    left join t2
    on t.cid = t2.cid and t.date = t2.date and t2.row = 1)
select date, new_loc, count(distinct cid)
from t3
group by 1, 2
order by 1, 2

It works, but it seems inefficient for 100 million rows.
I am looking for a more efficient way.

Upvotes: 1

Views: 1867

Answers (3)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656596

Assuming this objective:

Truncate dates to the month.
Pick one location per (month, cid), the home location getting priority.
Then count rows per (month, location).

SELECT date, location, count(*)
FROM  (
   SELECT DISTINCT ON (1, 2)  --  choose **one** location per (month, cid)
          date_trunc('month', date)::date AS date, cid, location
   FROM   tbl
   ORDER  BY 1, 2, birth_place = location DESC  -- priority to home location, else **arbitrary**
   ) sub
GROUP  BY 1, 2
ORDER  BY 1, 2;  -- optional

db<>fiddle here

Be aware that the arbitrary pick in case of "not been home" can give varying results! You may want to define a stable (opportune) pick instead.

There may be faster query variants, depending on undisclosed details.

About DISTINCT ON and performance:

About the sort order:

If there can be NULL values:

Upvotes: 1

robru
robru

Reputation: 1

My approach would be using case...when within a count. That way it works with or without a where filter and therefore allows other aggregates of the data in the same query in the future.

SELECT
    date_trunc('month', date)::date AS date, t.location
    , count(distinct (case when t.location=t.birth_place then t.cid else null end)) as "count"
FROM theTable AS t
WHERE t.location=t.birth_place
GROUP BY date_trunc('month', date)::date, t.location

Upvotes: 0

Dai
Dai

Reputation: 155085

WITH q1 AS (

    SELECT
        EXTRACT( YEAR  FROM t."date" ) AS "Year",
        EXTRACT( MONTH FROM t."date" ) AS "Month",

        t.cid,
        t.birth_place,
        t.location
    FROM
        theTable AS t
    WHERE
        t.location = t.birth_place
)
SELECT
    "Year",
    "Month",
    "location",
    COUNT( DISTINCT cId ) AS "COUNT( DISTINCT cId )",
    COUNT( * ) AS "CountAll"
FROM
    q1
GROUP BY
    "Year",
    "Month",
    "location"
ORDER BY
    "Year",
    "Month",
    "location"

SqlFiddle example.

enter image description here

Upvotes: 1

Related Questions