Get top x% rows for every unique value in column by other column value

Question

Table "tags":

Source  Target      Weight
#003    blitzkrank  0.83
#003    deutsch     0.7
#003    brammen     0.57
#003    butzfrauen  0.55
#003    solaaaa     0.5
#003    moments     0.3
college scandal     1.15
college prosecutors 0.82
college students    0.41
college usc         0.33
college full house  0.17
college friends     0.08
college house       0.5
college friend      0.01

The table has 5.600.000 rows and ~91.000 unique entries in column "Source".

For every unique value in "Source" and "Target" I need the top x% rows (e.g. top 20%, top 30%, needs to be variable) by Weight (table is sorted by "Source" (ascending) and "Weight" (descending).

If rows have the same "Weight" take rows in alphabetical order.
If x% == 0 take at least one row.

Since there will be duplicates (e.g. "Source = "college" will yield at least one duplicate row as "Target" = "scandal") duplicated entries should be removed if possible. Otherwise not a big deal.

Calculation for "Source":

6 rows where Source = "#003", 6 * 0.2 = 1.2 = take 1 row
8 rows where Source = "college", 8 * 0.2 = 1.6 = take 2 rows

Needed result table for "Source":

Source  Target      Weight
#003    blitzkrank  0.83
college scandal     1.15
college prosecutors 0.82

How can I do that in SQL in an SQLite database?

Gordon Linoff · Accepted Answer

If you want a sample by source:

select t.*
from (select t.*,
             row_number() over (partition by source order by weight desc, target) as seqnum,
             count(*) over (partition by source) as cnt
      from t
     ) t
where seqnum = 1 or  -- always at least one row
      seqnum <= round(cnt * 0.2);

Based on your example, I think this is what you want. You can construct a similar query for target.

Get top x% rows for every unique value in column by other column value

Answers (1)

Related Questions