TrewTzu
TrewTzu

Reputation: 1150

How to replicated SQL window rank in python

If we assume i have a input dataset (list of lists) with id, data, and a score value and I would like to filter down to the highest scoring day for each id. Normally in SQL I would do this with a window and rank function but i can't think of a Pythonic way of approaching this.

Here is a native solution:

data = [
    ["123", "11/11/11", "0.5"],
    ["555", "12/11/11", "0.3"],
    ["555", "13/11/11", "0.9"],
    ["123", "14/11/11", "0.8"]
]
_sorted = sorted( data, key=lambda record: (record[0], record[2]), reverse=True)

output = []

last_id_seen = None
for record in _sorted:
    if record[0] is last_id_seen:
        continue
    last_id_seen = record[0]
    output.append(record)
print(output)

# output
# [['555', '13/11/11', '0.9'], ['123', '14/11/11', '0.8']]

But this feels clumsy and I don't know how well the sort will support a more complex situation. Also I'd ideally like to avoid a Pandas or Numpy solution as i dont think they are needed here.

Suggestions?

Upvotes: 1

Views: 94

Answers (1)

Sıddık Açıl
Sıddık Açıl

Reputation: 967

data = [
    ["123", "11/11/11", "0.5"],
    ["555", "12/11/11", "0.3"],
    ["555", "13/11/11", "0.9"],
    ["123", "14/11/11", "0.8"]
] # data 

from itertools import groupby # groupby function
# Sort on id and score
_sorted = sorted( data, key=lambda record: (record[0], record[2]), reverse=True)

for k, v in groupby(_sorted, lambda x: x[0]): # group by id
    # k: ids, v: groups
    print(list(v)[0]) # print

I have used groupby from itertools to group sorted array on ID column. Since we have a reverse order on score key, getting the first element v[0] of each group is enough.

Upvotes: 1

Related Questions