Reputation: 2401
I have a table with the following data (paypal transactions):
txn_type | date | subscription_id
----------------+----------------------------+---------------------
subscr_signup | 2014-01-01 07:53:20 | S-XXX01
subscr_signup | 2014-01-05 10:37:26 | S-XXX02
subscr_signup | 2014-01-08 08:54:00 | S-XXX03
subscr_eot | 2014-03-01 08:53:57 | S-XXX01
subscr_eot | 2014-03-05 08:58:02 | S-XXX02
I want to get the average subscription length overall for a given time period (subscr_eot
is the end of a subscription). In the case of a subscription that is still ongoing ('S-XXX03'
) I want it to be included from it's start date until now in the average.
How would I go about doing this with an SQL statement in Postgres?
Upvotes: 1
Views: 119
Reputation: 656291
Using the window function lag()
, this becomes considerably shorter:
SELECT avg(ts_end - ts) AS avg_subscr
FROM (
SELECT txn_type, ts, lag(ts, 1, localtimestamp)
OVER (PARTITION BY subscription_id ORDER BY txn_type) AS ts_end
FROM t
) sub
WHERE txn_type = 'subscr_signup';
lag()
conveniently takes a default value for missing rows. Exactly what we need here, so we don't need COALESCE
in addition.
The query builds on the fact that subscr_eot
sorts before subscr_signup
.
Probably faster than presented alternatives so far because it only needs a single sequential scan - even though the window functions add some cost.
Using the column ts
instead of date
for three reasons:
timestamp
.Using localtimestamp
instead of now()
or current_timestamp
since you are obviously operating with timestamp [without time zone]
.
Also, your columns txn_type
and subscription_id
should not be text
Maybe an enum
for txn_type
and integer
for subscription_id
. That would make table and indexes considerably smaller and faster.
For the query at hand, the whole table has to be read an indexes won't help - except for a covering index
in Postgres 9.2+, if you need the read performance:
CREATE INDEX t_foo_idx ON t (subscription_id, txn_type, ts);
Upvotes: 0
Reputation: 125204
SQL Fiddle. Subscription length for each subscription:
select
subscription_id,
coalesce(t2.date, current_timestamp) - t1.date as subscription_length
from
(
select *
from t
where txn_type = 'subscr_signup'
) t1
left join
(
select *
from t
where txn_type = 'subscr_eot'
) t2 using (subscription_id)
order by t1.subscription_id
The average:
select
avg(coalesce(t2.date, current_timestamp) - t1.date) as subscription_length_avg
from
(
select *
from t
where txn_type = 'subscr_signup'
) t1
left join
(
select *
from t
where txn_type = 'subscr_eot'
) t2 using (subscription_id)
Upvotes: 3
Reputation: 95512
I used a couple of common table expressions; you can take the pieces apart pretty easily to see what they do.
One of the reasons this SQL is complicated is because you're storing column names as data. (subscr_signup and subscr_eot are actually column names, not data.) This is a SQL anti-pattern; expect it to cause you much pain.
with subscription_dates as (
select
p1.subscription_id,
p1.date as subscr_start,
coalesce((select min(p2.date)
from paypal_transactions p2
where p2.subscription_id = p1.subscription_id
and p2.txn_type = 'subscr_eot'
and p2.date > p1.date), current_date) as subscr_end
from paypal_transactions p1
where txn_type = 'subscr_signup'
), subscription_days as (
select subscription_id, subscr_start, subscr_end, (subscr_end - subscr_start) + 1 as subscr_days
from subscription_dates
)
select avg(subscr_days) as avg_days
from subscription_days
-- add your date range here.
avg_days
--
75.6666666666666667
I didn't add your date range as a WHERE clause, because it's not clear to me what you mean by "a given time period".
Upvotes: 1