Satish
Satish

Reputation: 17407

MySQL RAND() optimization with LIMIT option

I have 50,000 rows in table and i am running following query but i heard it is a bad idea but how do i make it work better way?

mysql> SELECT t_dnis,account_id FROM mytable WHERE o_dnis = '15623157085' AND enabled = 1 ORDER BY RAND() LIMIT 1;
+------------+------------+
| t_dnis     | account_id |
+------------+------------+
| 5623157085 | 1127       |
+------------+------------+

Any other way i can make is query faster or user other options?

I am not DBA so sorry if this question asked before :(

Note: currently we are not seeing performance issue but we are growing so could be impact in future so just want to know + and - point before are are out of wood.

Upvotes: 0

Views: 228

Answers (3)

Arnold Daniels
Arnold Daniels

Reputation: 16573

The problem with ORDER BY RAND() LIMIT 1 is that MySQL will give each row a random values and that sort, performing a full table scan and than drops all the results but one.

This is especially bad on a table with a lot of row, doing a query like

SELECT * FROM foo ORDER BY RAND() LIMIT 1

However in your case the query is already filtering on o_dnis and enabled. If there are only a limited number of rows that match (like a few hundred), doing an ORDER BY RAND() shouldn't cause a performance issue.

The alternative required two queries. One to count and the other one to fetch.

in pseudo code

count = query("SELECT COUNT(*) FROM mytable WHERE o_dnis = '15623157085' AND enabled = 1").value
offset = random(0, count - 1)
result = query("SELECT t_dnis, account_id FROM mytable WHERE o_dnis = '15623157085' AND enabled = 1 LIMIT 1 OFFSET " + offset).row

Note: For the pseudo code to perform well, there needs to be a (multi-column) index on o_dnis, enabled.

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269743

This query:

SELECT t_dnis, account_id
FROM mytable
WHERE o_dnis = '15623157085' AND enabled = 1
ORDER BY RAND()
LIMIT 1;

is not sorting 50,000 rows. It is sorting the number of rows that match the WHERE clause. As you state in the comments, this is in the low double digits. On a handful of rows, the use of ORDER BY rand() should not have much impact on performance.

You do want an index. The best index would be mytable(o_dnis, enabled, t_dnis, account_id). This is a covering index for the query, so the original data pages do not need to be accessed.

Under most circumstances, I would expect the ORDER BY to be fine up to at least a few hundred rows, if not several thousand. Of course, this depends on lots of factors, such as your response-time requirements, the hardware you are running on, and how many concurrent queries are running. My guess is that your current data/configuration does not pose a performance problem, and there is ample room for growth in the data without an issue arising.

Upvotes: 1

LSerni
LSerni

Reputation: 57388

Unless you are running on very slow hardware, you should not experience problems in sorting (much? less than) 50,000 rows. So if you still ask the question, this makes me suspect that your problem does not lie in the RAND().

For example one possible cause of slowness could be not having a proper index - in this case you can go for a covering index:

CREATE INDEX mytable_ndx ON enabled, o_dnis, t_dnis, account_id;

or the basic

CREATE INDEX mytable_ndx ON enabled, o_dnis;

At this point you should already have good performances.

Otherwise you can run the query twice, either by counting the rows or just priming a cache. Which to choose depends on the data structure and how many rows are returned; usually, the COUNT option is the safest bet.

SELECT COUNT(1) AS n FROM mytable WHERE ...

which gives you n, which allows you to generate a random number k in the same range as n, followed by

SELECT ... FROM mytable LIMIT k, 1

which ought to be really fast. Again, the index will help you speeding up the counting operation.

In some cases (MySQL only) you could perhaps do better with

SELECT SQL_CACHE SQL_CALC_FOUND_ROWS ... FROM mytable WHERE ...

using the calc_found_rows() function to recover n, then run the second query which should take advantage of the cache. It's best if you experiment first, though. And changes in the table demographics might cause performance to fall.

Upvotes: 0

Related Questions