Vici0usX
Vici0usX

Reputation: 11

Order By Two Columns - Using Highest Rating Average with Most Ratings

I would like to show ratings with the highest average (rating_avg) AND number of ratings(rating_count). With my current script, it shows the highest average rating (DESC) regardless of how many ratings there are, which is useless for my visitors.

For example it shows:

  1. Item 1 - 5.0 (1 Ratings)
  2. Item 2 - 5.0 (2 Ratings)

When it should be showing the Top 10 Highest rated items by rating avg and amount of ratings, such as:

  1. Item 1 - 4.5 (356 Ratings)
  2. Item 2 - 4.3 (200 Ratings)
  3. Item 3 - 4.0 (400 Ratings)

This is what I have right now:

$result = mysql_query("SELECT id, filename, filenamedisplay, console_dir, downloads, rating_avg, rating_count FROM files WHERE console_dir = '".$nodash."' ORDER BY rating_avg DESC LIMIT 10");

Thanks and I appreciate any help in advance!

Upvotes: 1

Views: 168

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1271231

This is a subtle problem and an issue in statistics. What I do is often to downgrade the ratings by one standard error for the proportion. These aren't exactly proportions, but I think the same idea can be applied.

You can calculate this using the "square root of p*q divided by n" method. If you don't understand this, google "standard error of a proportion" (or I might suggest the third chapter in "Data Analysis Using SQL and Excel" which explains this in more detail):

SELECT id, filename, filenamedisplay, console_dir, downloads, rating_avg, rating_count
FROM files cross join
     (select count(*) as cnt from files where console_dir = '".$nodash."') as const
WHERE console_dir = '".$nodash."'
ORDER BY rating_avg/5 - sqrt((rating_avg/5) * (1 - rating_avg/5) / const.cnt) DESC
LIMIT 10;

In any case, see if the formula works for you.

EDIT:

Okay, let's change this to the standard error of the mean. I should have done this the first time through, but I was thinking the rating_avg was a proportion. The formula is the standard deviation divided by the square root of the sample size. We can get the population standard deviation in the const subquery:

     (select count(*) as cnt, stdev(rating_avg) as std from files where console_dir = '".$nodash."') as const

This results in:

order by rating_avg - std / sqrt(const.cnt)

This might work, but I would rather have the standard deviation within each group rather than the overall population standard deviation. But, it derates the rating by an amount proportional to the size of the sample, which should improve your results.

By the way, the idea of removing one standard deviation is rather arbitrary. I've just found that it produces reasonable results. You might prefer to take, say, 1.96 times the standard deviation to get a 95% lower bound on the confidence interval.

Upvotes: 3

Related Questions