Reputation: 862
I have the following table:
| sample_id (varchar, unique) | field1 (int) | field2 (int) | ...
--------------------------------------------------------------------
| 9b7acb476c4ab04c7ddbc | 100 | 56 | ...
| a2e4df67e98ccaf088abf | 23 | NULL | ...
| fcbe9cecd6b96cba7c6ee | NULL | 43 | ...
...
I have the following code created by a prior user to query two fields at the same time and getting a random subset of the rows:
SELECT sample_id, field1, field2
FROM samples
WHERE field1 != NULL
UNION ALL
SELECT sample_id, field1, field2
FROM samples
WHERE field2 != NULL
ORDER BY RAND()
LIMIT 1000
I thought of optimizing the code by rewriting the query as:
SELECT sample_id, field1, field2
FROM samples
WHERE field1 != NULL
OR field2 != NULL
ORDER BY RAND()
LIMIT 1000
Based on some documentation I read here it seems that both the queries are equivalent but I'm not sure how the ORDER BY RAND()
line would be handled in the query. Is it only applied to the second query (i.e. the query after the UNION ALL
)?
Upvotes: 0
Views: 49
Reputation: 1269633
[THIS WAS THE ORIGINAL VERSION OF THE QUESTION]
Not at all. != NULL
will filter out all data, because almost all comparisons to NULL
return NULL
, which is treated as false.
!= ''
will return all values that do not contain an empty string and are not NULL
.
The correct comparisons to NULL
use is null
and is not null
.
[AFTER THE EDIT] The query you want is:
SELECT sample_id, field1, field2
FROM samples
WHERE field1 IS NOT NULL OR field2 IS NOT NULL
ORDER BY RAND()
LIMIT 1000;
Upvotes: 1
Reputation: 57388
First of all, your queries are currently not working. They both select using conditions that are false, as all NULL comparisons are:
mysql> select '' != NULL, NULL != NULL, 0 != NULL, 'hello' != NULL, 42 != NULL, (1=0)!=NULL, (1=1)!=NULL;
+------------+--------------+-----------+-----------------+------------+-------------+-------------+
| '' != NULL | NULL != NULL | 0 != NULL | 'hello' != NULL | 42 != NULL | (1=0)!=NULL | (1=1)!=NULL |
+------------+--------------+-----------+-----------------+------------+-------------+-------------+
| NULL | NULL | NULL | NULL | NULL | NULL | NULL |
+------------+--------------+-----------+-----------------+------------+-------------+-------------+
1 row in set (0.00 sec)
select 1 from test where null;
Empty set (0.00 sec)
Now if you use a different condition, e.g. WHERE field1 IS NOT NULL
, the queries may be still be subtly not equivalent.
The first updated UNIONed subquery will now return rows where field1 is not null. This will be supplemented by rows where field2 is not null.
The UNION ALL suppresses duplicates.
A row with field1 and field2 both null, if it exists, will be selected twice by UNION ALL, and have double probability of being chosen.
So in both cases you get at most 1000 records, but the two sets will be subtly different.
It may well be that your optimized query, once updated with IS NOT NULL instead of !=, is the actual query you needed from the beginning.
But if you do want double probability for doubly-nulled rows, then the optimized query will not be equivalent, and it may skew results if you use those data as input in some stochastic process.
Upvotes: 0