GRoutar
GRoutar

Reputation: 1425

How to choose the weight for a weighted average?

I'm conducting a feature extraction process for a machine learning problem and I came across with an issue.

Consider a set of products. Each product is rated as either 0 or 1, which maps to bad or good, respectively. Now I want to compute, for each unique product, a rating score in the [0, n] interval, where n is an integer number greater than 0.

The total ratings for each product are obviously different so a simple average will originate issues such as:

avg_ratio_score = good_rates / total_rates
a) 1/1 = 1
b) 95/100 = 0.95

Even though the ratio a) is higher, ratio b) gives much more confidence to an user. For this reason, I need a weighted average.

The problem is what weight to choose. The products' frequency varies from around 100 to 100k.

My first approach was the following:

ratings frequency interval    weight
--------------------------    ------
90k - 100k                      20
80k - 90k                       18
70k - 80k                       16
60k - 70k                       14
50k - 60k                       12
40k - 50k                       11
30k - 40k                       10
20k - 30k                        8
10k - 20k                        6
5k - 10k                         4
1k - 5k                          3
500 - 1k                         2
100 - 500                        1
1 - 100                        0.5

weighted_rating_score = good_ratings * weight / total_ratings

At first this sounded like a good solution, but looking at a real example it might not be as good as it looks:

 a. 90/100 = 0.9 * 0.5 = 0.45
 b. 50k/100k = 0.5 * 20 = 10

Such result suggests that product b) is a much better alternative than product a) but looking at the original ratios that might not be the case.

I would like to know an effective (if there is one) way to calculate the perfect weight or other similar suggestions.

Upvotes: 0

Views: 3843

Answers (1)

Ofer Litver
Ofer Litver

Reputation: 11

I believe the answer to your question is subjective, since the importance you choose to relate to the uncertainty caused be the smaller number of samples, is also subjective.

However, thinking in terms of "penalty" for the lower number of samples, I could think of another way to correct the rating for the lower number of samples. Looking at the following formula:

(GoodRates / TotalRates) - alpha * (1 / TotalRates)

I could not add the formula's image inline, but you can see it here.

This formula causes the ratings to approach the simple rating, as TotalRates approaches infinity. Effectively, even numbers in order of magnitude of hundreds and above become negligible. Choosing different values of alpha will increase or decrease the importance of the lower number of total rates.

Of coarse, you can always consider more complex rating approaches that will capture other properties of your data, such as larger penalty for higher rate with the same number of observations, and so on.

Upvotes: 1

Related Questions