Reputation: 254
Okay - I have a dilemma. So far my script converts page titles into categories. This is based on keywords, and when there is a match a certain score is added, I.e some words hold the value of 10, some only 1. This gets accumulated into a total score for each category.
[{15: [32, 'massages']}, {45: [12, 'hair-salon']}, {23,:[3, 'automotive service']}]
Index being the category id, first value the score second value the category.
In some instances this spans to over 10 category matches.
How can I filter this to only the top 60-75%
I.e clearly massages and hair salon are the most as they are well above automotive service. But how can this intelligence we use be programmed?
I was thinking stddev could help?
Edit
I am trying to filter out low scoring items e.g.
data = [{15: [32, 'massages']}, {45: [1, 'hair-salon']}, {23:[1, 'automotive service']}]]
Massages is the only high scoring item in this instance
data = [{15: [4, 'massages']}, {45: [2, 'hair-salon']}, {23:[1, 'automotive service']}]]
Stil massages
data = [{15: [10, 'massages']}, {45: [50, 'hair-salon']}, {23:[5, 'automotive service']}]]
Now hair-salon (as it is well above others)
So I need not take the first (N) objects, moreso, the first objects that are x higher then other numbers as a percentage or form of standard deviation.
So 50 is much higher then 10 and 5
10 is much higher then 3 or 2
However 9, 8 and 6 are much the same
Upvotes: 1
Views: 2729
Reputation: 19329
Here's a solution using heapq.nlargest()
import heapq
data = [{15: [32, 'massages']}, {45: [12, 'hair-salon']}, {23:[3, 'automotive service']}]
N = int(len(data) * 0.6 + 1)
print heapq.nlargest(N, data, key = lambda x: next(x.itervalues())[0])
This prints:
[{15: [32, 'massages']}, {45: [12, 'hair-salon']}]
Edit: If you want to eliminate "low scoring items" then you need to define exactly what you mean by "low scoring".
Here's some code which takes an entirely arbitrary definition of "low scoring": a score is low if it is more than one standard deviation below the maximum:
import math
data = [{15: [32, 'massages']}, {45: [1, 'hair-salon']}, {23:[3, 'automotive service']}]
scores = [score for d in data for cat,(score,name) in d.iteritems()]
score_mean = sum(scores) / float(len(scores))
score_stdev = math.sqrt(sum(abs(s - score_mean)**2 for s in scores) / float(len(scores)))
print [d for d in data if next(d.itervalues())[0] > (max(scores) - score_stdev)]
This prints:
[{15: [32, 'massages']}]
Upvotes: 6
Reputation: 56654
yourdata = [{15: [32, 'massages']}, {45: [12, 'hair-salon']}, {23:[3, 'automotive service']}]
# transfer your data into a more usable format
data = [(score,cat,name) for dat in yourdata for cat,(score,name) in dat.iteritems()]
# sort on descending score
data.sort(reverse=True)
# throw away the low-scoring items
data = data[:int(len(data)*0.6 + 1)]
returns
[(32, 15, 'massages'), (12, 45, 'hair-salon')]
(the two highest-scoring items)
Upvotes: 2