Reputation: 15475
I'm curious how to do normalizing of numbers for a ranking algorithm
let's say I want to rank a link based on importance and I have two columns to work with
so a table would look like
url | comments | views
now I want to rank comments higher than views so I would first think to do comments*3 or something to weight it, however if there is a large view number like 40,000 and only 4 comments then the comments weight gets dropped out.
So I'm thinking I have to normalize those scores down to a more equal playing field before I can weight them. Any ideas or pointers to how that's usually done?
thanks
Upvotes: 3
Views: 8037
Reputation: 1993
A similar problem was discussed a few weeks ago in this SO topic: "Algorithm to calculate a page importance based on its views / comments".
I'll give the same advice I offered there: use linear regression on a representative distribution of comment/view counts for web pages to work out a weighting function.
Upvotes: 0
Reputation: 11
Importance is really a way of notifying the user about how interested he could be in the forum topic or a blog spot. In this case, you can't just multiply two numbers by different factors and add :)
What can you say about a blogpost with 2000 views and only one comment. Well, perhaps it's a spam post, or it was viewed by web-crawlers, or it's so boring that no one decided to comment on it.
In this case, we might want to look at a ratio of comments versus views. My original post would have an "interest ratio" of 1/2000 while this post, which got 28 views and 1 comment right now, it would get a score of 1/28.
The biggest ratio wins. By the way, if you are having ratios over one... well, start looking for bugs :)
Upvotes: 1
Reputation: 1554
For each url, you could first normalize the comments and views to a percentile. For example,
comment_percentile = (comments - min(comments)) / (max(comments) - min(comments))
views_percentile = (views - min(views)) / (max(views) - min(views))
Then you could assign weights to each of the percentile values to compute the overall score.
url_score = (comment_percentile_weight * comment_percentile) + (views_percentile_weight * views_percentile)
Additional strategies may involve eliminating outliers if the values cluster toward one end of the range.
Upvotes: 5