Deleting Outliers and Compute for the Average Using Mysql

I am computing for the average lead time after outliers are removed. I have both z-scores and standard deviation to do the calculation.

I have used this sql query:

SELECT *
FROM (
    SELECT
        ROUND(AVG(DATEDIFF(shipped_date, order_date)),2) AS actual_ave_lead_time,
        DATEDIFF(shipped_date, order_date) - AVG(DATEDIFF(shipped_date, order_date))/
        STDDEV(DATEDIFF(shipped_date, order_date)) AS zscore
    FROM orders
) AS score_table
WHERE zscore BETWEEN zscore<1.96 AND >-.96;

I am expecting to get the overall average of the actual_ave_lead_time.

Upvotes: 0

Views: 94

Answers (1)

user1191247
user1191247

Reputation: 12998

To calculate your z-scores you need to run the calculation (z = (x-μ)/σ) per order. The innermost sub-query calculates the AVG() and STDDEV() for the full set of orders, which is then joined to the orders to calculate the z-scores, which can then be used to exclude the outliers from the outermost AVG() -

SELECT AVG(lead_time)
FROM (
    SELECT
        DATEDIFF(shipped_date, order_date) AS lead_time,
        (DATEDIFF(shipped_date, order_date) - avg_lead_time) / stddev_lead_time AS zscore
    FROM orders o
    JOIN (
        SELECT
            AVG(DATEDIFF(shipped_date, order_date)) AS avg_lead_time,
            STDDEV(DATEDIFF(shipped_date, order_date)) AS stddev_lead_time
        FROM orders
    ) s
    HAVING zscore BETWEEN -0.96 AND 1.96
) t;

If you are using MySQL 8 you could use the aggregate functions as window functions -

SELECT AVG(lead_time)
FROM (
    SELECT
        DATEDIFF(shipped_date, order_date) AS lead_time,
        (DATEDIFF(shipped_date, order_date) - AVG(DATEDIFF(shipped_date, order_date)) OVER()) / STDDEV(DATEDIFF(shipped_date, order_date)) OVER() AS zscore
    FROM orders
) t
WHERE zscore BETWEEN -0.96 AND 1.96;

Without example data, this is untested but I think it is correct.

Upvotes: 2

Related Questions