Reputation: 143
I am working on a project that needs to say that a certain ID is most likely. Let me explain using example. I have 3 dictionaries that contain ID's and their score
Ex: d1 = {74701: 3, 90883: 2}
I assign percentage score like this,
d1_p = {74701: 60.0, 90883: 40.0} , here the score is the (value of key in d1)/(total sum of values)
Similarly i have 2 other dictionaries
d2 = {90883: 2, 74701: 2} , d2_p = {90883.0: 50.0, 74701.0: 50.0}
d3 = {75853: 2}, d3_p = {75853: 100.0}
My task is to give a composite score for each ID from the above 3 dictionaries a decide a winner by taking the highest score. How would i mathematically assign a composite score between 0-100 for each of these ID's??
Ex: in above case 74701 needs to be the clear winner.
I tried giving average, but it fails, because I need to give more preference for the ID's that occur in multiple dictionaries. Ex: lets say 74701 was majority in d1 and d2 with 30,40 values. then its average will be (30+40+0)/3 = 23.33 , while 75853 which occurs only once with 100% will get (100+0+0)/3 = 33.33 and it will be given as winner, which is wrong.
Hence can somone suggest a good mathematical way in python with maybe code to give such score and decide majority?
Upvotes: 0
Views: 78
Reputation: 796
Instead of trying to create a global score from different dictionaries, since your main goal is to analyze frequency I would suggest to summarize all the data into a single dictionary, which is less error prone in general. Say I have 3 dictionaries:
a = {1: 2, 2: 3}
b = {2: 4, 3: 5}
c = {3: 4, 4: 9}
You could summarize these three dictionaries into one by summing the values for each key:
result = {1: 2, 2: 7, 3: 9, 4: 9}
That could be easily achieved by using Counter
:
from collections import Counter
result = Counter(a)
result.update(Counter(b))
result.update(Counter(c))
result = dict(result)
Which would yield the desired summary. If you want different weights for each dictionary that could also be done in a similar fashion, the takeaway is that you should not be trying to obtain information from the dictionaries as separate entities, but instead merge them together into one statistic.
Upvotes: 1
Reputation: 42421
Think of the data in a tabular way: for each game/match/whatever, each ID gets a certain number of points. If you care the most about overall point total for the entire sequences of games (the entire "season", so to speak), then add up the points to determine a winner (and then scale everything down/up to 0 to 100).
74701 90883 75853
---------------------------
1 3 2 0
2 2 2 0
3 0 0 2
Total 5 4 2
Alternatively, we can express those same scores in percentage terms per game. Again, every ID must be given a value. In this case, we need to average the percentages -- all of them, including the zeros:
74701 90883 75853
---------------------------
1 .6 .4 0
2 .5 .5 0
3 0 0 100
Avg .37 .30 .33
Both approaches could make sense, depending on the context. And both also declare 74701 to be the winner, as desired. But notice that they give different results for 2nd and 3rd place. Such differences occur because the two systems prioritize different things. You need to decide which approach you prefer.
Either way, the first step is to organize the data better. It seems more convenient to have all scores or percentages for each ID, so you can do the needed math with them: that sounds like a dict mapping IDs to lists of scores or percentages.
# Put the data into one collection.
d1 = {74701: 3, 90883: 2}
d2 = {90883: 2, 74701: 2}
d3 = {75853: 2}
raw_scores = [d1, d2, d3]
# Find all IDs.
ids = tuple(set(i for d in raw_scores for i in d))
# Total points/scores for each ID.
points = {
i : [d.get(i, 0) for d in raw_scores]
for i in ids
}
# If needed, use that dict to create a similar dict for percentages. Or you
# could create a dict with the same structure holding *both* point totals and
# percentages. Just depends on the approach you pick.
pcts = {}
for i, scores in points.items():
tot = sum(scores)
pcts[i] = [sc / tot for sc in scores]
Upvotes: 0