richliaw
richliaw

Reputation: 2045

Why does nested query in SQLite return the wrong value?

I'm dealing with the yelp dataset on sqlite3, and I have a schema as follows:

    CREATE TABLE businesses
     (business_id text primary key,
      name text,
      full_address text,
      city text,
      stars decimal(1,1),
      ratings integer);

I run the following query:

     select name, stars from businesses where ratings >= 400;

And I get the following:

    Pho Kim Long |3.5
    Pho So 1     |3.5
    Ichiza       |4
    ...
    Guy Fieri    |3.5
    Giada        |4

If I run:

    select name, min(stars) from businesses where ratings >= 400;

I get The Quad Las Vegas Resort & Casino|2

But if I run:

    select name 
    from (select name, min(stars) 
          from businesses 
          where ratings >= 400);

I get Giada.

I know that there's a way to do this with LIMIT X, but why does this occur in the first place? Is there something about the SQLite parser that I'm neglecting?

Upvotes: 2

Views: 711

Answers (4)

pilcrow
pilcrow

Reputation: 58544

The other answers here are correct (and thus upvoted), but I'm not sure they quite get you from where you are to where you want to be.

Why isn't the query's result predictable?

Typically, aggregate functions like MIN/MAX/COUNT are used with GROUP BY clauses. For example:

-- Count the number of businesses per city
SELECT city, COUNT(1) FROM biz GROUP BY city;

-- Find the minimum rating of any business per city
SELECT city, MIN(stars) FROM biz GROUP BY city;

What happens when you omit the GROUP BY clause but keep the non-aggregated field ("city") in the query? You're only going to get one MIN()imum, but which "city" will it be? What if two or more cities are tied for lowest rated businesses?

-- Er, which city will it be?
SELECT city, MIN(stars) FROM biz;

Some RDBMS prohibit this construction entirely, and the SQL specs disallow it.* However, SQLite and some others will return a single row in this case, with the minimum rating across the entire table and an arbitrarily chosen "city". You cannot rely on that row to represent all the aggregates. Indeed, depending on the RDBMS version it may not even represent any properly aggregated result.** As you've seen, SQLite is free to pick one row sometimes, and another row other times.

What should you be doing?

If I understand, you want to select the very worst rated restaurant(s) (lowest "stars"), but only of those restaurants with at least 400 reviews ("ratings").

Either of @JuanCarlosOropeza's approaches work: find MIN(stars) across all records WHERE ratings >= 400 and then use that to select the name of each business whose stars equal that. (SQL analytic function, unsupported in SQLite at the time of this writing, offer more ways to compute the same.)

* As @harmic notes, the situation is a little more nuanced than that.

** As @CL. points out, later versions of SQLite do guarantee that the row selected does match the aggregate function. That is, the name you get does actually have the MIN(stars) rating.

Upvotes: 0

CL.
CL.

Reputation: 180060

SQLite 3.7.11 or later guarantees that the values of unaggregated columns come from a row that matches the min() or max().

However, unless you are using one of the latest versions, the query planner will optimize away the min()/max() values if they are not returned from the query. You must list the min(stars) in the outermost query.

Upvotes: 1

harmic
harmic

Reputation: 30587

In the SQLite docs, it says:

If the SELECT statement is an aggregate query without a GROUP BY clause, then each aggregate expression in the result-set is evaluated once across the entire dataset. Each non-aggregate expression in the result-set is evaluated once for an arbitrarily selected row of the dataset. The same arbitrarily selected row is used for each non-aggregate expression.

In other words, the actual name you are getting is purely random in this case, it does not bear any relation to the row that actually has the minimum value of stars.

According to the SQL-92 standard, in a query containing aggregate expressions and non-aggregate expressions all of the non-aggregate expressions must appear in the GROUP BY clause. SQL-2003 has a similar but somewhat more complicated set of rules. This blog article provides a summary.

Some databases allow making queries that include aggregate and non-aggregate expressions without any GROUP BY clause, or without all non-aggregates included in the GROUP BY clause, but as you have discovered the results may be indeterminate. Other databases will display an error and refuse to run the query.

It is hard to give specific advice on how to correct your query because you have not stated what output you are trying to get. If you are trying to find out which row has the minimum value of stars then one of the proposals in Juan's answer should work.

Upvotes: 2

Juan Carlos Oropeza
Juan Carlos Oropeza

Reputation: 48197

The problem is your MIN() function bring the smallest value for stars, but not the name match for that row.

SQL Fiddle Demo

You can do a cross join

SELECT name
FROM businesses b,
     ( SELECT min(stars) as MinValue
       FROM businesses 
       WHERE ratings >= 400) as M
WHERE b.stars = M.MinValue;

OR Inner Select

SELECT name
FROM businesses b         
WHERE b.stars = ( SELECT min(stars) as MinValue
                  FROM businesses 
                  WHERE ratings >= 400);

Upvotes: 1

Related Questions