Reputation: 23
I have the following table:
CREATE TABLE tbl (
id int NOT NULL
, date date NOT NULL
, cid int NOT NULL
, birth_place text NOT NULL
, location text NOT NULL
);
INSERT INTO tbl VALUES
(1 , '2022-01-01', 1, 'France' , 'Germany')
, (2 , '2022-01-30', 1, 'France' , 'France')
, (3 , '2022-01-25', 2, 'Spain' , 'Spain')
, (4 , '2022-01-12', 3, 'France' , 'France')
, (5 , '2022-02-01', 4, 'England', 'Italy')
, (6 , '2022-02-12', 1, 'France' , 'France')
, (7 , '2022-03-05', 5, 'Spain' , 'England')
, (8 , '2022-03-08', 2, 'Spain' , 'Spain')
, (9 , '2022-03-15', 2, 'Spain' , 'Spain')
, (10, '2022-03-30', 5, 'Spain' , 'Italy')
, (11, '2022-03-22', 4, 'England', 'England')
, (12, '2022-03-22', 3, 'France' , 'England');
I need to count distinct customers (= cid
) per month and location - with a special twist:
If a customer they went back to their birth place (location = birth_place
) during any given month, give priority to that location. Else pick one location per month and customer arbitrarily.
My desired output:
date location count
2022-01-01 France 2
2022-01-01 Spain 1
2022-02-01 Italy 1
2022-02-01 France 1
2022-03-01 Spain 1
2022-03-01 England 3
cid
1 in 2022-01-01 had a location = birth_place
, and no other customer had Germany as location in that time period, hence there is no Germany in my desired output location.
This is my current query:
with
t as (
select id, date_trunc('month', date)::date AS date, cid, birth_place, location
from tbl),
t1 as (
select date, cid, location
from t
where birth_place = location),
t2 as (
select date, cid, location, row_number() over (partition by date, cid order by date) as row
from t
where birth_place <> location),
t3 as (
select t.*,
case
when t1.location is not null then t1.location
else t2.location
end as new_loc
from t
left join t1
on t.cid = t1.cid and t.date = t1.date
left join t2
on t.cid = t2.cid and t.date = t2.date and t2.row = 1)
select date, new_loc, count(distinct cid)
from t3
group by 1, 2
order by 1, 2
It works, but it seems inefficient for 100 million rows.
I am looking for a more efficient way.
Upvotes: 1
Views: 1867
Reputation: 656596
Assuming this objective:
Truncate dates to the month.
Pick one location per (month, cid), the home location getting priority.
Then count rows per (month, location).
SELECT date, location, count(*)
FROM (
SELECT DISTINCT ON (1, 2) -- choose **one** location per (month, cid)
date_trunc('month', date)::date AS date, cid, location
FROM tbl
ORDER BY 1, 2, birth_place = location DESC -- priority to home location, else **arbitrary**
) sub
GROUP BY 1, 2
ORDER BY 1, 2; -- optional
db<>fiddle here
Be aware that the arbitrary pick in case of "not been home" can give varying results! You may want to define a stable (opportune) pick instead.
There may be faster query variants, depending on undisclosed details.
About DISTINCT ON
and performance:
About the sort order:
If there can be NULL values:
Upvotes: 1
Reputation: 1
My approach would be using case...when
within a count. That way it works with or without a where
filter and therefore allows other aggregates of the data in the same query in the future.
SELECT
date_trunc('month', date)::date AS date, t.location
, count(distinct (case when t.location=t.birth_place then t.cid else null end)) as "count"
FROM theTable AS t
WHERE t.location=t.birth_place
GROUP BY date_trunc('month', date)::date, t.location
Upvotes: 0
Reputation: 155085
WITH q1 AS (
SELECT
EXTRACT( YEAR FROM t."date" ) AS "Year",
EXTRACT( MONTH FROM t."date" ) AS "Month",
t.cid,
t.birth_place,
t.location
FROM
theTable AS t
WHERE
t.location = t.birth_place
)
SELECT
"Year",
"Month",
"location",
COUNT( DISTINCT cId ) AS "COUNT( DISTINCT cId )",
COUNT( * ) AS "CountAll"
FROM
q1
GROUP BY
"Year",
"Month",
"location"
ORDER BY
"Year",
"Month",
"location"
Upvotes: 1