Daniel
Daniel

Reputation: 3063

sqlite: get the average of the top X% for every item

Is it possible to get the average of the top X% items in a group?

For example:
I have a table which has a item_id, timestamp and price column. The output should be grouped by item_id and timestamp and the 'price-column' should get averaged. For the averaging only the lowest X% prices within that group should be used.

I've found similar questions (How to select top x records for every group) but this won't work with sqlite.

Upvotes: 3

Views: 2223

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1270411

Getting the top n records within each group requires counting. Assuming that there are no duplicates, the following query returns the number of records for an item:

select t.*,
       (select count(*) from t t2 where t2.item_id = t.item_id
       ) as NumPrices
from t

This is called a correlated subquery. Now, let's extend the idea to include a rank and then calculate the average for the right group:

select item_id, avg(price)
from (select t.*,
             (select count(*) from t t2 where t2.item_id = t.item_id
             ) as NumPrices,
             (select count(*) from t t2 where t2.item_id = t.item_id and t2.price <= t.price
             ) as PriceRank
      from t
     ) t
where (100.0*PriceRank / NumPrices) <= X
group by item_id

To improve performance, you will want an index on (item_id, price).

Upvotes: 5

CL.
CL.

Reputation: 180172

To get the count of records in the group with ID I and timestamp T, use this query:

SELECT COUNT(*)
FROM MyTable
WHERE item_id = I
  AND timestamp = T

To get the limit, multiply with X, and use ROUND/CAST to convert to an integer:

SELECT CAST(ROUND(COUNT(*) * X / 100) AS INTEGER)
FROM MyTable
WHERE item_id = I
  AND timestamp = T

To get all records in a specific group that are inside that limit, order the records in the group by price, and limit the returned count:

SELECT *
FROM MyTable
WHERE item_id = I
  AND timestamp = T
ORDER BY price
LIMIT (SELECT CAST(ROUND(COUNT(*) * X / 100) AS INTEGER)
       FROM MyTable
       WHERE item_id = I
         AND timestamp = T)

In theory, to get the group averages, add GROUP BY around that:

SELECT item_id,
       timestamp,
       (SELECT AVG(price)
        FROM (SELECT price
              FROM MyTable T2
              WHERE T2.item_id = T1.item_id
                AND T2.timestamp = T1.timestamp
              ORDER BY price
              LIMIT (SELECT CAST(ROUND(COUNT(*) * X / 100) AS INTEGER)
                     FROM MyTable T3
                     WHERE T3.item_id = T1.item_id
                       AND T3.timestamp = T1.timestamp)
             )
       ) AS AvgPriceLowestX
FROM MyTable T1
GROUP BY item_id,
         timestamp

However, it appears that SQLite does not allow access to correlation variables from inside the LIMIT clause, so this does not work in practice. You would have to get the IDs of all groups (SELECT DISTINCT item_id, timestamp FROM MyTable) and execute the third query above for each group.

In any case, ensure that you have one index on the three columns item_id, timestamp, and price to get good performance.

Upvotes: 1

Related Questions