Reputation: 1425
I'm conducting a feature extraction process for a machine learning problem and I came across with an issue.
Consider a set of products. Each product is rated as either 0 or 1, which maps to bad or good, respectively. Now I want to compute, for each unique product, a rating score in the [0, n]
interval, where n
is an integer number greater than 0.
The total ratings for each product are obviously different so a simple average will originate issues such as:
avg_ratio_score = good_rates / total_rates
a) 1/1 = 1
b) 95/100 = 0.95
Even though the ratio a) is higher, ratio b) gives much more confidence to an user. For this reason, I need a weighted average.
The problem is what weight to choose. The products' frequency varies from around 100 to 100k.
My first approach was the following:
ratings frequency interval weight
-------------------------- ------
90k - 100k 20
80k - 90k 18
70k - 80k 16
60k - 70k 14
50k - 60k 12
40k - 50k 11
30k - 40k 10
20k - 30k 8
10k - 20k 6
5k - 10k 4
1k - 5k 3
500 - 1k 2
100 - 500 1
1 - 100 0.5
weighted_rating_score = good_ratings * weight / total_ratings
At first this sounded like a good solution, but looking at a real example it might not be as good as it looks:
a. 90/100 = 0.9 * 0.5 = 0.45
b. 50k/100k = 0.5 * 20 = 10
Such result suggests that product b) is a much better alternative than product a) but looking at the original ratios that might not be the case.
I would like to know an effective (if there is one) way to calculate the perfect weight or other similar suggestions.
Upvotes: 0
Views: 3843
Reputation: 11
I believe the answer to your question is subjective, since the importance you choose to relate to the uncertainty caused be the smaller number of samples, is also subjective.
However, thinking in terms of "penalty" for the lower number of samples, I could think of another way to correct the rating for the lower number of samples. Looking at the following formula:
(GoodRates / TotalRates) - alpha * (1 / TotalRates)
I could not add the formula's image inline, but you can see it here.
This formula causes the ratings to approach the simple rating, as TotalRates approaches infinity. Effectively, even numbers in order of magnitude of hundreds and above become negligible. Choosing different values of alpha will increase or decrease the importance of the lower number of total rates.
Of coarse, you can always consider more complex rating approaches that will capture other properties of your data, such as larger penalty for higher rate with the same number of observations, and so on.
Upvotes: 1