Reputation: 862

Are these SQL statements equal?

I have the following table:

| sample_id (varchar, unique) | field1 (int) | field2 (int) | ... 
--------------------------------------------------------------------
| 9b7acb476c4ab04c7ddbc       | 100          | 56           | ...
| a2e4df67e98ccaf088abf       | 23           | NULL         | ...
| fcbe9cecd6b96cba7c6ee       | NULL         | 43           | ...
...

I have the following code created by a prior user to query two fields at the same time and getting a random subset of the rows:

SELECT   sample_id, field1, field2
FROM     samples
WHERE    field1 != NULL
UNION ALL
SELECT   sample_id, field1, field2
FROM     samples
WHERE    field2 != NULL
ORDER BY RAND()
LIMIT    1000

I thought of optimizing the code by rewriting the query as:

SELECT   sample_id, field1, field2
FROM     samples
WHERE    field1 != NULL
OR       field2 != NULL
ORDER BY RAND()
LIMIT    1000

Based on some documentation I read here it seems that both the queries are equivalent but I'm not sure how the ORDER BY RAND() line would be handled in the query. Is it only applied to the second query (i.e. the query after the UNION ALL)?

Upvotes: 0

Answers (3)

Gordon Linoff

Reputation: 1269633

[THIS WAS THE ORIGINAL VERSION OF THE QUESTION]

Not at all. != NULL will filter out all data, because almost all comparisons to NULL return NULL, which is treated as false.

!= '' will return all values that do not contain an empty string and are not NULL.

The correct comparisons to NULL use is null and is not null.

[AFTER THE EDIT] The query you want is:

SELECT sample_id, field1, field2
FROM samples
WHERE field1 IS NOT NULL OR field2 IS NOT NULL
ORDER BY RAND()
LIMIT 1000;

Upvotes: 1

Vorpulus Lyphane

Reputation: 702

The order by is applied to the result of the union.

Upvotes: 0

LSerni

Reputation: 57388

First of all, your queries are currently not working. They both select using conditions that are false, as all NULL comparisons are:

mysql> select '' != NULL, NULL != NULL, 0 != NULL, 'hello' != NULL, 42 != NULL, (1=0)!=NULL, (1=1)!=NULL;
+------------+--------------+-----------+-----------------+------------+-------------+-------------+
| '' != NULL | NULL != NULL | 0 != NULL | 'hello' != NULL | 42 != NULL | (1=0)!=NULL | (1=1)!=NULL |
+------------+--------------+-----------+-----------------+------------+-------------+-------------+
|       NULL |         NULL |      NULL |            NULL |       NULL |        NULL |        NULL |
+------------+--------------+-----------+-----------------+------------+-------------+-------------+
1 row in set (0.00 sec)

select 1 from test where null;
Empty set (0.00 sec)

Now if you use a different condition, e.g. WHERE field1 IS NOT NULL, the queries may be still be subtly not equivalent.

The first updated UNIONed subquery will now return rows where field1 is not null. This will be supplemented by rows where field2 is not null.

The UNION ALL suppresses duplicates.

A row with field1 and field2 both null, if it exists, will be selected twice by UNION ALL, and have double probability of being chosen.

So in both cases you get at most 1000 records, but the two sets will be subtly different.

It may well be that your optimized query, once updated with IS NOT NULL instead of !=, is the actual query you needed from the beginning.

But if you do want double probability for doubly-nulled rows, then the optimized query will not be equivalent, and it may skew results if you use those data as input in some stochastic process.

Upvotes: 0

Are these SQL statements equal?

Answers (3)

Related Questions