HetMes
HetMes

Reputation: 410

Algorithm to do efficient weighted ranking?

I need an algorithm to do fast weighted ranking of Twitter posts.

Each post has a number of ranking scores (like age, author follower count, keyword mentions, etc.). I'm looking for algorithm that can quickly find the top N Tweets, given the weights of each ranking score.

Now, the use case is that these weights will change, and recalculating the ranking scores for every tweet every time the weights change is prohibitively expensive.

I will have access to sorted lists of Tweets, one for each ranking score. So I'm looking for an algorithm to efficiently search through these lists to find my top N.

Upvotes: 2

Views: 1330

Answers (1)

Brendan
Brendan

Reputation: 37232

NOTICE: This answer is provided due to the belief that knowledge is always good (even if it might be used for evil purposes). If you are able to obtain and store/track information like age, author follower count, keyword mentions, etc without ensuring participants fully understand how their data will be used and without obtaining every participant's explicit consent (and without "opt-in, with the ability to opt out at any time"); then you are violating people's privacy and deserve bankruptcy for your grossly unethical malware. It's bad enough that multiple large companies are evil without making it worse.

Assume there's a formula like score = a_rank * a_weight + b_rank * b_weight + c_rank * c_weight.

This can be split into pieces, like:

 a_score = a_rank * a_weight
 b_score = b_rank * b_weight
 c_score = c_rank * c_weight
 score = a_score + b_score + c_score

If you know the range of a_rank you can sort the entries into "a_rank buckets". For example, if you have 100 buckets and "a_rank" can be a value from "a_rank_min" to "a_rank_max"; then "bucket_number = (a_rank - a_rank_min) * 100 / (a_rank_max - a_rank_min)".

From here you can say that all entries in a specific "a_rank bucket" must have an "a_score" in a specific range; and you can calculate the minimum and maximum possible "a_score" for all entries in a bucket from "bucket_number" alone; using formulas like "min_a_score_for_bucket = (bucketNumber * (a_rank_max - a_rank_min) / 100 + a_rank_min) * a_weight" and "max_a_score_for_bucket = ( (bucketNumber+1) * (a_rank_max - a_rank_min) / 100 + a_rank_min) * a_weight - 1".

The next step is to establish a "current 10 entries with the highest score so far". Do this by selecting the first 10 entries from the highest "a_rank bucket/s" and calculate their scores fully.

Once this is done (and you know "10th highest score so far") you can calculate a filter for each bucket. If you assume all entries in a bucket have the maximum possible a_rank (determined from the bucket number alone) and the maximum possible c_rank (determined from the possible range of all c_rank values) then you can calculate the minimum value for b_rank that would be needed for the entry's score to be higher than "10th highest score so far"; and in the same way, if you assume all entries in a bucket have the maximum possible a_rank and the maximum possible b_rank you can calculate the minimum value for c_rank that would be needed. The "minimum needed b_rank" and "minimum needed c_rank" can then be used to skip over entries that couldn't possibly beat the "10th highest score so far" without calculating the score for any of those entries.

Of course every time you find an entry with a higher score than the "10th highest score so far" you will get a new "10th highest score so far" and will have to recalculate the "minimum needed b_rank" and "minimum needed c_rank" for the buckets. Ideally you'd look at buckets in "highest a_rank bucket first" order and therefore will only calculate the "minimum needed b_rank" and "minimum needed c_rank" for the current bucket

Near the start (while you're looking at the bucket with the highest a_rank values) it probably won't filter out many entries and might even make performance worse (due to the cost of recalculating "minimum needed b_rank" and "minimum needed c_rank" values). Near the end (while you're looking at the buckets with the lowest a_rank values) you may be able to skip entire buckets without looking at any entry in them.

Note that:

  • all the weights can change without changing any of the buckets; but it's nicer for performance if "a_rank" has the strongest influence on the score.

  • the range of values for "a_rank" shouldn't change (you'd have to rebuild the buckets if it does); but the range of values for "b_rank" and "c_rank" can be variable (updated every time a new entry is created)

  • sorting each bucket in "highest a_rank first" order (and then using "highest b_rank first" as a tie-breaker, etc) will help performance when finding the 10 entries with the highest score; but it will also add overhead when an entry is added. For this reason, for most cases, I probably wouldn't bother sorting the contents of buckets at all.

  • it would be nice if you can have a bucket for each possible value of "a_rank"; as this gives almost all of the benefits of sorting without any of the overhead of sorting. If you can't have a bucket for each possible value of "a_rank", then increasing the number of buckets can help performance.

  • in theory; it would be possible to have multiple layers of "bucketing" (e.g. "a_rank buckets" that contain "b_rank buckets"). This would significantly increase complexity, and increase memory consumption; but (especially if no sorting is done) might significantly improve performance (and might make performance worse).

Upvotes: 1

Related Questions