rahulthewall
rahulthewall

Reputation: 795

SQL Nested Query is slow

I have a database where users are represented by hashes. Each user (hash) has associated values (pertaining to the music track that he was listening to). Since, a user can listen to more than one track there are repeat instances of the user and associated data (with data being different as it is now a different track).

What I would like to do is select ~10 users at random from this database, and then find their associated data.

Currently, the code that I am using is this:

SELECT *
FROM `tblPlayLogV4`
WHERE `titleId` <> 0 AND `hash` IN (SELECT `hash` FROM `tblPlayLogV4` WHERE `titleId` <> 0 AND RAND() <= 0.1 GROUP BY `hash` HAVING COUNT(`hash`) > 500);

Why RAND() - because LIMIT is not allowed in inner queries. Idea for RAND() here - http://www.rndblog.com/how-to-select-random-rows-in-mysql/

The above query takes ages to complete.

If however, I run the inner query separately, it finishes in 4.53s. I then hard code the result of the inner query in the outer query, and that finishes in about 275 ms. The separated queries are presented below:

SELECT `hash` FROM `tblPlayLogV4` WHERE `titleId` <> 0 AND RAND() <= 0.1 GROUP BY `hash` HAVING COUNT(`hash`) > 500);

SELECT * FROM `tblPlayLogV4` WHERE `hash` IN ('-29e291921cccd06a5813bca17b7f7c3','-2c08232108dcd93c443d821165c2c79','-58285c1602072da713e51cc6cdc6313','-5bcc2c42482d5805277609a84474aef','-79ecab520d661a1d624de7e7b04f265','-e937c753a96fc9e441f83af97b08489','04d3f1e91e4e92970819190346405d2d','3f9f0cd502de38d47e39367cdfdd6722') AND `titleId`<>0;

Can someone please explain to me why this is happening? What is it that I am doing wrong? And if there is a better way for me to formulate my query, do tell me.

Number of entries in the database: 6,322,605

Upvotes: 0

Views: 2047

Answers (2)

user359040
user359040

Reputation:

As Dems said, your existing query is executing your RAND() selection in the sub-query for every record in your main query's table.

So try rewriting your main query like this:

SELECT f.*
FROM (SELECT `hash` 
      FROM `tblPlayLogV4` 
      WHERE `titleId` <> 0 AND RAND() <= 0.1 
      GROUP BY `hash` 
      HAVING COUNT(`hash`) > 500) r
JOIN `tblPlayLogV4` f on r.`hash` = f.`hash` and f.`titleId` <> 0;

Upvotes: 1

MatBailie
MatBailie

Reputation: 86775

The difference is that the first query is processing all 6,322,605 records. But the second query is processing only 10. So yes, the second query will be slow. Basically you need a way of choosing 10 out of 6 million, without processing all 6 million...


The 'simplest' alternative to for each record to be given a "sequence_id" column, and index that column. You can the just generate 10 random values, and directly pull out those records with SELECT * FROM table WHERE sequence_id IN (a,b,c,d,etc). This does, however, require you to ensure that the sequence_id has no gaps.

So, do you ever DELETE from that table, or just append to it?

If you do delete from it, you could still ensure 'no-gaps' by taking records from the end of the table, and updating their sequence_id with the values just deleted (possibly with a trigger). The feasibility of this depends on how often you delete from the table.

Upvotes: 1

Related Questions