What will be faster for GROUP BY statement

Question

Imagine that I have the next two SQL Server tables:

CREATE TABLE Users (
    id INT IDENTITY(1, 1) PRIMARY KEY,
    name VARCHAR(100) NOT NULL
)

CREATE TABLE UserLogins (
   id INT IDENTITY(1, 1) PRIMARY KEY,
   user_id INT REFERENCES Users(id) NOT NULL,
   login VARCHAR(100) NOT NULL
)

And I need to get a count of user logins for each user. And the query result should contain user name, for example. Which query will work faster:

SELECT MAX(name), count(*)
FROM Users u
    INNER JOIN UserLogins ul ON ul.user_id = u.id
GROUP BY u.id

or the next one:

SELECT name, count(*)
FROM Users u
    INNER JOIN UserLogins ul ON ul.user_id = u.id
GROUP BY u.name

So, I'm not sure, if it will be better to group by the column with an index and then use MAX or MIN aggregate function. Or just group by Users.name, which doesn't have any indexes.

Thank you in advance!

Charlieface · Accepted Answer

The answer is: neither is really correct.

The second version is completely wrong as name is not unique. The first version is correct, although it may not be efficient.

Since name has a functional dependency on id, every unique value of id also defines a value of name. Grouping by name is wrong, because name is not necessarily unique. Grouping only by id means you need to aggregate name, which makes no sense if there is a functional dependency. So you actually want to group by both columns:

SELECT
  u.name,
  count(*)
FROM Users u
    INNER JOIN UserLogins ul ON ul.user_id = u.id
GROUP BY
  u.id,
  u.name;

Note that id does not actually need to be selected.

This query is almost certainly going to be faster than grouping by name alone, because the server cannot deduce that name is unique and needs to sort and aggregate it.

It may also be faster than grouping by id, although that may depend on whether the optimizer is clever enough to deduce the functional dependency (and therefore no aggregation would be necessary). Even if it isn't clever, this probably won't be slow, as id is already unique, so a scan of an index over id would not require a sort, only aggregation.

What will be faster for GROUP BY statement

Answers (1)

Related Questions