Room'on
Room'on

Reputation: 93

SQL how to calculate median not based on rows

I have a sample of cars in my table and I would like to calculate the median price for my sample with SQL. What is the best way to do it?

+-----+-------+----------+
| Car | Price | Quantity |
+-----+-------+----------+
| A   |   100 |        2 |
| B   |   150 |        4 |
| C   |   200 |        8 |
+-----+-------+----------+

I know that I can use percentile_cont (or percentile_disc) if my table is like this:

+-----+-------+
| Car | Price |
+-----+-------+
| A   |   100 |
| A   |   100 |
| B   |   150 |
| B   |   150 |
| B   |   150 |
| B   |   150 |
| C   |   200 |
| C   |   200 |
| C   |   200 |
| C   |   200 |
| C   |   200 |
| C   |   200 |
| C   |   200 |
| C   |   200 |
+-----+-------+

But in the real world, my first table has about 100 million rows and the second table should have about 3 billiard rows (and moreover I don't know how to transform my first table into the second).

Upvotes: 0

Views: 257

Answers (2)

George Joseph
George Joseph

Reputation: 5922

Here is a way to do this in sql server

In the first step i do is calculate the indexes corresponding to the lower and upper bounds for the median (if we have odd number of elements then the lower and upper bounds are same else its based on the x/2 and x/2+1th value)

Then i get the cumulative sum of the quantity and the use that to choose the elements corresponding to the lower and upper bounds as follows

with median_dt
  as (
select case when sum(quantity)%2=0 then
                 sum(quantity)/2 
            else 
                 sum(quantity)/2 + 1
        end as lower_limit
      ,case when sum(quantity)%2=0 then
                 (sum(quantity)/2) + 1
            else 
                 sum(quantity)/2 + 1
        end as upper_limit  
 from t 
     )
    ,data
    as (
 select *,sum(quantity) over(order by price asc) as cum_sum
   from t
       )
   ,rnk_val
   as(select * 
       from (
             select price,row_number() over(order by d.cum_sum asc) as rnk
               from data d
               join median_dt b
                 on b.lower_limit<=d.cum_sum
             )x 
      where x.rnk=1
      union all 
     select * 
       from (
             select price,row_number() over(order by d.cum_sum asc) as rnk
               from data d
               join median_dt b
                 on b.upper_limit<=d.cum_sum
             )x 
      where x.rnk=1
      ) 
  select avg(price) as median
    from rnk_val



+--------+
| median |
+--------+
|    200 |
+--------+

db fiddle link https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=c5cfa645a22aa9c135032eb28f1749f6

Upvotes: 1

Isolated
Isolated

Reputation: 6454

This looks right on few results, but try on a larger set to double-check.

First create a table which has the total for each car (or use CTE or sub-query), your choice. I'm just creating a separate table here.

    create table table2 as
    (
        select car, 
        quantity, 
        price, 
        price * quantity as total
        from table1
    )

Then run this query, which looks for the price group that falls in the middle.

    select price
    from (
        select car, price, 
        sum(total) over (order by car) as rollsum, 
        sum(total) over () as total
        from table2
        )a
    where rollsum >= total/2

Correctly returns a value of $200.

Upvotes: 1

Related Questions