Join Left or WHERE solution - Most efficient?

Question

I am learning about databases at college, and have the assignment about finding the minimum avg exam grade for a college course. I have made two solutions, but I hope you experts in here can help me with:

What is the best/most effective solution?

Solution 1:

SELECT courses.name , MIN(avg_grade)
FROM (SELECT courseCode, AVG(grade) as avg_grade
      FROM exams
      GROUP BY courseCode) avg_grades, courses
WHERE courses.code = avg_grades.courseCode

Solution 2:

SELECT name, min(avg_grade)
FROM (SELECT courses.name, AVG(grade) as avg_grade
      FROM courses
      LEFT JOIN exams on exams.courseCode = courses.code
      GROUP BY courseCode) mytable

And I have been thinking about if JOIN or LEFT JOIN is the correct to use here?

GarethD · Accepted Answer

Your two queries are different, so you can't really compare efficiency, your second query will return records for courses with no exam results. Assuming that you switch the LEFT JOIN to an INNER to make the queries comparable, then I would expect the first query to be slightly more efficient since it only has one derived table, and the second has two:

Solution 1:

ID  SELECT_TYPE     TABLE   TYPE    POSSIBLE_KEYS   KEY KEY_LEN REF ROWS    FILTERED    EXTRA
1   PRIMARY         ALL                                             5       100 
1   PRIMARY courses ALL                                             5       100     Using where; Using join buffer
2   DERIVED exams   ALL                                             5       100     Using temporary; Using filesort

Solution 2:

ID  SELECT_TYPE     TABLE   TYPE    POSSIBLE_KEYS   KEY KEY_LEN REF ROWS    FILTERED    EXTRA
1   PRIMARY         ALL                                             5       100 
2   DERIVED courses ALL                                             5       100         Using temporary; Using filesort
2   DERIVED exams   ALL                                             5       100         Using where; Using join buffer

I would however check this against your own execution plans as mine was just a quick example on SQL Fiddle.

I would like to take this chance to advise against using the ANSI-89 implicit join syntax, it was replaced over 20 years ago by the explicit join syntax in the ANSI-92 standard. Aaron Bertrand has written a great article on why to switch, I won't duplicate it here.

Another, much more important point though is that your queries are not deterministic, that is to say you could run the same query twice and get 2 different results even with no underlying change in the data.

Taking your second query as an example (although you will notice both queries are wrong on the SQL-Fiddle), you have a subquery MyTable like so:

SELECT courses.name, AVG(grade) as avg_grade
FROM courses
LEFT JOIN exams on exams.courseCode = courses.code
GROUP BY courseCode

This returned a table like so:

Name    |   avg_grade
--------+--------------
   A    |       10
   B    |       5
   C    |       6
   D    |       7
   E    |       2

You may expect the query as a whole to return:

Name    |   avg_grade
--------+--------------
   E    |       2

Since 2 is the lowest average grade, and E is the name that corresponds with that. You would be wrong though, as demonstrated here you can see this actually returns:

Name    |   avg_grade
--------+--------------
   A    |       2

What is essentially happening is that MySQL is calculating the minimum avg_grade correctly, but since you have not added any columns to the group by you have given MySQL Carte blanche to choose any value for Name it chooses.

To get the output you want, I think you need:

SELECT  courses.name , MIN(avg_grade)
FROM    (   SELECT  courseCode, AVG(grade) as avg_grade
            FROM    exams
            GROUP BY courseCode
        ) avg_grades
        INNER JOIN courses
            ON courses.code = avg_grades.courseCode
GROUP BY courses.Name;

Or if you only want to the course with the lowest average grade then use:

SELECT  courseCode, AVG(grade) as avg_grade
FROM    exams
GROUP BY courseCode
ORDER BY avg_grade
LIMIT 1;

Examples on SQL Fiddle

Please excuse the laziness of what I am about to do, but I have explained this problem a lot before, and now have a standard response that I post to explain the issue of MySQL grouping. It goes into more detail than the above, and hopefully explains it further.

MySQL Implicit Grouping

I would advise to avoid the implicit grouping offered by MySQL where possible, by this i mean including columns in the select list, even though they are not contained in an aggregate function or the group by clause.

Imagine the following simple table (T):

ID  | Column1 | Column2  |
----|---------+----------|
1   |    A    |    X     |
2   |    A    |    Y     |

In MySQL you can write

SELECT  ID, Column1, Column2
FROM    T
GROUP BY Column1;

This actually breaks the SQL Standard, but it works in MySQL, however the trouble is it is non-deterministic, the result:

ID  | Column1 | Column2  |
----|---------+----------|
1   |    A    |    X     |

Is no more or less correct than

ID  | Column1 | Column2  |  
----|---------+----------|
2   |    A    |    Y     |

So what you are saying is give me one row for each distinct value of Column1, which both results sets satisfy, so how do you know which one you will get? Well you don't, it seems to be a fairly popular misconception that you can add and ORDER BY clause to influence the results, so for example the following query:

SELECT  ID, Column1, Column2
FROM    T
GROUP BY Column1
ORDER BY ID DESC;

Would ensure that you get the following result:

ID  | Column1 | Column2  |  
----|---------+----------|
2   |    A    |    Y     |

because of the ORDER BY ID DESC, however this is not true (as demonstrated here).

The MySQL documents state:

The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate. Furthermore, the selection of values from each group cannot be influenced by adding an ORDER BY clause.

So even though you have an order by this does not apply until after one row per group has been selected, and this one row is non-deterministic.

The SQL-Standard does allow columns in the select list not contained in the GROUP BY or an aggregate function, however these columns must be functionally dependent on a column in the GROUP BY. For example, ID in the sample table is the PRIMARY KEY, so we know it is unique in the table, so the following query conforms to the SQL standard and would run in MySQL and fail in many DBMS currently (At the time of writing Postgresql is the closest DBMS I know of to correctly implementing the standard):

SELECT  ID, Column1, Column2
FROM    T
GROUP BY ID;

Since ID is unique for each row, there can only be one value of Column1 for each ID, one value of Column2 there is no ambiguity about what to return for each row.

EDIT

From the SQL-2003-Standard (5WD-02-Foundation-2003-09 - page 346) - http://www.wiscorp.com/sql_2003_standard.zip

If T is a grouped table, then let G be the set of grouping columns of T. In each contained in , each column reference that references a column of T shall reference some column C that is functionally dependent on G or shall be contained in an aggregated argument of a whose aggregation query is QS.

Join Left or WHERE solution - Most efficient?

Answers (1)

Related Questions