Standard deviation of a set of dates

I have a table of transactions with columns id | client_id | datetime and I have calculated the mean of days between transactions to know how often this transactions are made by each client:

SELECT *, ((date_last_transaction - date_first_transaction)/total_transactions) AS frequency 
FROM (
    SELECT client_id, COUNT(id) AS total_transactions, MIN(datetime) AS date_first_transaction, MAX(datetime) AS date_last_transaction
    FROM transactions
    GROUP BY client_id
) AS t;

What would be the existing methods to calculate the standard deviation (in days) in a set of dates with postgresql? Preferably with only one query, if it is posible :-)

Upvotes: 2

Views: 2749

Answers (1)

I have found this way:

SELECT extract(day from date_trunc('day', (
        CASE WHEN COUNT(*) <= 1 THEN 
            0 
        ELSE 
            SUM(time_since_last_invoice)/(COUNT(*)-1) 
        END
    ) * '1 day'::interval)) AS days_between_purchases, 
    extract(day from date_trunc('day', (
        CASE WHEN COUNT(*) <= 2 THEN 
            0 
        ELSE 
            STDDEV(time_since_last_invoice) 
        END
    ) * '1 day'::interval)) AS range_of_days
FROM (
    SELECT client_id, datetime, COALESCE(datetime - lag(datetime) 
              OVER (PARTITION BY client_id ORDER BY client_id, datetime
                 ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
              ), 0
           ) AS time_since_last_invoice
    FROM my_table 
    GROUP BY client_id, datetime
    ORDER BY client_id, datetime
)

Explanation: This query groups by client and date and then calculates the difference between each pair of transaction dates (datetime) by client_id and returns a table with these results. After this, the external query processes the table and calculates de average time between differences greater than 0 (first value in each group is excluded because is the first transaction and therefore the interval is 0). The standard deviation is calculated when there existe 2 o more transaction dates for the same client, to avoid division by zero errors. All differences are returned in PostgreSQL interval format.

Upvotes: 1

Related Questions