AlexZ
AlexZ

Reputation: 254

Python - Find top largest numbers in array

Okay - I have a dilemma. So far my script converts page titles into categories. This is based on keywords, and when there is a match a certain score is added, I.e some words hold the value of 10, some only 1. This gets accumulated into a total score for each category.

[{15: [32, 'massages']}, {45: [12, 'hair-salon']}, {23,:[3, 'automotive service']}]

Index being the category id, first value the score second value the category.

In some instances this spans to over 10 category matches.

How can I filter this to only the top 60-75%

I.e clearly massages and hair salon are the most as they are well above automotive service. But how can this intelligence we use be programmed?

I was thinking stddev could help?

Edit

I am trying to filter out low scoring items e.g.

data = [{15: [32, 'massages']}, {45: [1, 'hair-salon']}, {23:[1, 'automotive service']}]]

Massages is the only high scoring item in this instance

data = [{15: [4, 'massages']}, {45: [2, 'hair-salon']}, {23:[1, 'automotive service']}]]

Stil massages

data = [{15: [10, 'massages']}, {45: [50, 'hair-salon']}, {23:[5, 'automotive service']}]]

Now hair-salon (as it is well above others)

So I need not take the first (N) objects, moreso, the first objects that are x higher then other numbers as a percentage or form of standard deviation.

So 50 is much higher then 10 and 5

10 is much higher then 3 or 2

However 9, 8 and 6 are much the same

Upvotes: 1

Views: 2729

Answers (2)

srgerg
srgerg

Reputation: 19329

Here's a solution using heapq.nlargest()

import heapq

data = [{15: [32, 'massages']}, {45: [12, 'hair-salon']}, {23:[3, 'automotive service']}]

N = int(len(data) * 0.6 + 1)
print heapq.nlargest(N, data, key = lambda x: next(x.itervalues())[0])

This prints:

[{15: [32, 'massages']}, {45: [12, 'hair-salon']}]

Edit: If you want to eliminate "low scoring items" then you need to define exactly what you mean by "low scoring".

Here's some code which takes an entirely arbitrary definition of "low scoring": a score is low if it is more than one standard deviation below the maximum:

import math

data = [{15: [32, 'massages']}, {45: [1, 'hair-salon']}, {23:[3, 'automotive service']}]

scores = [score for d in data for cat,(score,name) in d.iteritems()]
score_mean = sum(scores) / float(len(scores))
score_stdev = math.sqrt(sum(abs(s - score_mean)**2 for s in scores) / float(len(scores)))

print [d for d in data if next(d.itervalues())[0] > (max(scores) - score_stdev)]

This prints:

[{15: [32, 'massages']}]

Upvotes: 6

Hugh Bothwell
Hugh Bothwell

Reputation: 56654

yourdata = [{15: [32, 'massages']}, {45: [12, 'hair-salon']}, {23:[3, 'automotive service']}]

# transfer your data into a more usable format
data = [(score,cat,name) for dat in yourdata for cat,(score,name) in dat.iteritems()]

# sort on descending score
data.sort(reverse=True)

# throw away the low-scoring items
data = data[:int(len(data)*0.6 + 1)]

returns

[(32, 15, 'massages'), (12, 45, 'hair-salon')]

(the two highest-scoring items)

Upvotes: 2

Related Questions