Mister Mak
Mister Mak

Reputation: 294

Grouping and sorting weighted values by rank

I have data consisting of several values. Each value has a positive weight and a rank. I would like to return a sublist of these values such that values are added by increasing rank to the list until the sum of their weights is the smallest possible greater or equal to a given target, with the additional constraint that if a value is added, all of those with that same rank must be added, ie values are added to the list by ranking group.

I have written the alogorithm below, which is quite slow and not so easy to read. The input is a list of tuples (we can assume it is already sorted by rank); each tuple contains rank, value and weight. All that needs to be returned is the list of kept values, and their total weight. The algorithm groups the data by rank, then figures out how many ranks to add to reach the proper cumulative weight. Any suggestions to accelerate and improve?

For the input list [(1, 'a', 0.10131488885239891), (1, 'b', 0.05189220940446056), (1, 'c', 0.05233086643211586), (4, 'd', 0.019651238401109394), (5, 'e', 0.11439063810128784), (6, 'f', 0.003242047085661822), (6, 'g', 0.03093925229983648), (8, 'h', 0.09737202015393008), (8, 'i', 0.02024957487257707), (10, 'j', 0.08454098571100034), (11, 'k', 0.18524611312188527), (11, 'l', 0.07110968996322396), (11, 'm', 0.023236464232769254), (14, 'n', 0.14448401136774303)] and target weight 0.47, it returns the sublist ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'] and kept weight 0.49138273560337803.

weightGoal = 0.47 # Target weight
    
# Example of data; each tuple has (rank, value and weight)
individualData = [(1, 'a', 0.10131488885239891), (1, 'b', 0.05189220940446056), (1, 'c', 0.05233086643211586), (4, 'd', 0.019651238401109394), (5, 'e', 0.11439063810128784), (6, 'f', 0.003242047085661822), (6, 'g', 0.03093925229983648), (8, 'h', 0.09737202015393008), (8, 'i', 0.02024957487257707), (10, 'j', 0.08454098571100034), (11, 'k', 0.18524611312188527), (11, 'l', 0.07110968996322396), (11, 'm', 0.023236464232769254), (14, 'n', 0.14448401136774303)]
print("Input", individualData, weightGoal)

# Determine cumulative weight for each value.
vCumulWeight = [0]
for ind in individualData: vCumulWeight.append(vCumulWeight[-1] + ind[-1])
vCumulWeight.pop(0)

# To group by rank    
ranks = list(set([v[0] for v in individualData])) # Unique values of the rank
ranks.sort(reverse=False) # Sort the list by rank
rankWeights = [] # Total weight of values with same rank
rankValues = [] # Values with same rank

# Group points by rank
for r in ranks:
    rankW, rankV = 0, []
    for ind in individualData:
        if r == ind[0]:
            rankW += ind[-1]
            rankV.append(ind[1])
    rankWeights.append(rankW)
    rankValues.append(rankV)
    
# Get cumulative weight of each group
rankCumulWeight = [0]
for op in rankWeights: rankCumulWeight.append(rankCumulWeight[-1] + op)
rankCumulWeight.pop(0)

# Rank all values and all groups
fullRankingVals, fullRankingGroups, rnk, prevLen, prevCP = [], [], 0, 1, 0
for i, (o, pr, ov, cp) in enumerate(zip(ranks, rankWeights, rankValues, rankCumulWeight)):
    fullRankingGroups.append((i+1, ov, o, pr, cp, ((cp <= weightGoal) or (prevCP < weightGoal))))
    rnk += prevLen
    for v in ov: fullRankingVals.append((rnk, v, o, pr/len(ov), cp, ((cp <= weightGoal) or (prevCP < weightGoal))))
    prevLen = len(ov)
    prevCP = cp

print("============ Check by value")
for v in fullRankingVals: print(v)
print("============ Check by group")
for v in fullRankingGroups: print(v)
print("============")

## Actual acceptance
tmpAccepted = [v for v in fullRankingVals if (v[-1])]
acceptedValues = [v[1] for v in tmpAccepted]
accdeptedWeight = max(v[-2] for v in tmpAccepted)
print("Result", acceptedValues, accdeptedWeight)

Upvotes: 1

Views: 481

Answers (2)

Nk03
Nk03

Reputation: 14949

With Pandas:

import pandas as pd

df = pd.DataFrame(l, columns=['rank', 'value', 'weight'])
df = df.groupby('rank').agg({'value': list, 'weight': sum})

m = ~((df.weight.cumsum() > 0.47).shift(
    fill_value=False))

result = df[m]['value'].explode().to_list()
weight = df[m]['weight'].cumsum().iloc[-1]

Upvotes: 1

Meysam
Meysam

Reputation: 718

Might I suggest that you take the easiest approach possible, and just solve the problem in small chunks?

Let's break it down into 3 small steps so that you get to follow along what is actually happening:

  1. Sort the data in ascending order using rank as a key.
  2. Add data to our storage until we reach the limit weightGoal
  3. If the weightGoal is reached, add to our storage if the value is the same as the last one we added to our storage

Now let's code the above flow:

weightGoal = 0.47 # Target weight
    
# Example of data; each tuple has (rank, value and weight)
individualData = [(1, 'a', 0.10131488885239891), (1, 'b', 0.05189220940446056), (1, 'c', 0.05233086643211586), (4, 'd', 0.019651238401109394), (5, 'e', 0.11439063810128784), (6, 'f', 0.003242047085661822), (6, 'g', 0.03093925229983648), (8, 'h', 0.09737202015393008), (8, 'i', 0.02024957487257707), (10, 'j', 0.08454098571100034), (11, 'k', 0.18524611312188527), (11, 'l', 0.07110968996322396), (11, 'm', 0.023236464232769254), (14, 'n', 0.14448401136774303)]

# step 1
individualData.sort(key=lambda x: x[0])

storage = 0
last = None
for data in individualData:
    if storage < weightGoal:  # step 2
        last = data
        storage += data[2]
    elif last and data[1] == last[1]:  # step 3
        storage += data[2]
        # no need to change `last` here unless you really want to
    else:  # termination clause
        break

Now the final summation is stored in the storage variable and you also get the bonus of knowing which element was the last to add using the last variable.

Let me know if there's any ambiguity.

Upvotes: 1

Related Questions