JustArchi
JustArchi

Reputation: 129

Complex SQL ranking query

I have 3 fairly simple tables:

users

user_id
1
2
3

radio_songs

song_id  song
1        SomeName
2        OtherName

radio_rates

user_id  song_id  rate (from 1 to 5)
1        1        5
2        1        4
1        2        2
2        2        2

I've wrote quite complex query targetting MySQL that calculates current "position" (rank) of the song according to lower bound of Wilson score confidence interval for a Bernoulli parameter.

SELECT rank FROM(
    SELECT x.song AS song, x.ci_lower_bound AS ci_lower_bound, (@row:= @row + 1) AS rank FROM(
        SELECT song, ((SUM((rate - 1) * 0.25) + 1.9208) / (SUM((rate - 1) * 0.25) + SUM((5 - rate) * 0.25)) - 1.96 * SQRT((SUM((rate - 1) * 0.25) * SUM((5 - rate) * 0.25)) / (SUM((rate - 1) * 0.25) + SUM((5 - rate) * 0.25)) + 0.9604) / (SUM((rate - 1) * 0.25) + SUM((5 - rate) * 0.25))) / (1 + 3.8416 / (SUM((rate - 1) * 0.25) + SUM((5 - rate) * 0.25))) AS ci_lower_bound
        FROM radio_rates
        INNER JOIN radio_songs ON radio_rates.song_id = radio_songs.song_id 
        GROUP BY radio_rates.song_id
        ORDER BY ci_lower_bound DESC
    ) x, (SELECT @row := 0) r
) xx WHERE xx.song = @song

This query basically accepts @song parameter and:

The query works properly and I'm quite satisfied with it, but when we have multiple songs with the same score then due to sorting the resulting rank can vary between executions of the same SQL query. I wanted to avoid that by getting MIN() rank from all songs that have the same score as the target one, but the query got so complicated that I'm struggling how to do that without a temporary table - is it even possible?

I'd appreciate help, as well as any recommendations in terms of performance/optimization of above query.

I know that it'd be worth it to consider simply adding another score column to songs table and calculating it on each insert/update via trigger, but I'd want to avoid that if possible and calculate rank on-demand. Therefore SQL query itself is most important for me.

Thank you in advance.

Upvotes: 0

Views: 411

Answers (1)

Paul Spiegel
Paul Spiegel

Reputation: 31812

This might work for you:

SELECT rank FROM(
    SELECT x.song AS song,
           (@row:= @row + 1) AS rn,
           IF(@last_score = x.ci_lower_bound, @rank, @rank := @row) AS rank
           (@last_score := x.ci_lower_bound) AS ci_lower_bound
    FROM(
        SELECT song, ((SUM((rate - 1) * 0.25) + 1.9208) / (SUM((rate - 1) * 0.25) + SUM((5 - rate) * 0.25)) - 1.96 * SQRT((SUM((rate - 1) * 0.25) * SUM((5 - rate) * 0.25)) / (SUM((rate - 1) * 0.25) + SUM((5 - rate) * 0.25)) + 0.9604) / (SUM((rate - 1) * 0.25) + SUM((5 - rate) * 0.25))) / (1 + 3.8416 / (SUM((rate - 1) * 0.25) + SUM((5 - rate) * 0.25))) AS ci_lower_bound
        FROM radio_rates
        INNER JOIN radio_songs ON radio_rates.song_id = radio_songs.song_id 
        GROUP BY radio_rates.song_id
        ORDER BY ci_lower_bound DESC
    ) x, (SELECT @row := 0, @rank := null, @last_score := null) r
) xx WHERE xx.song = @song

The changes are:

SELECT x.song AS song,
       (@row:= @row + 1) AS rn,
       IF(@last_score = x.ci_lower_bound, @rank, @rank := @row) AS rank
       (@last_score := x.ci_lower_bound) AS ci_lower_bound

and

(SELECT @row := 0, @rank := null, @last_score := null) r

In this line

IF(@last_score = x.ci_lower_bound, @rank, @rank := @row) AS rank

You set the rank to the row number only if the score has changed compared to the last row. If the score is the same, the rank from the last row is used.

Warning: Using session variables this way, you are always at risk that your code will return unexpected results when you upgrade to a new version. If it works, then it is because of how the engine is implemented. There is no guarantee, that the expressions will be executed in the expected order.

As a general rule, other than in SET statements, you should never assign a value to a user variable and read the value within the same statement. For example, to increment a variable, this is okay:

SET @a = @a + 1;

For other statements, such as SELECT, you might get the results you expect, but this is not guaranteed. In the following statement, you might think that MySQL will evaluate @a first and then do an assignment second:

SELECT @a, @a:=@a+1, ...;

However, the order of evaluation for expressions involving user variables is undefined.

User-Defined Variables

Upvotes: 1

Related Questions