4532066
4532066

Reputation: 2110

Selecting random words from table

Say I have the following basic MySQL data:

CREATE TABLE my_words (my_word VARCHAR(255));

INSERT INTO my_words VALUES ('dog');
INSERT INTO my_words VALUES ('cat');
INSERT INTO my_words VALUES ('tree');
INSERT INTO my_words VALUES ('ball');
INSERT INTO my_words VALUES ('life');
INSERT INTO my_words VALUES ('complex');
INSERT INTO my_words VALUES ('digeridoo');
INSERT INTO my_words VALUES ('hamster');
INSERT INTO my_words VALUES ('it');
INSERT INTO my_words VALUES ('house');
INSERT INTO my_words VALUES ('love');
INSERT INTO my_words VALUES ('zealous');
INSERT INTO my_words VALUES ('nevis');
INSERT INTO my_words VALUES ('mountain');
INSERT INTO my_words VALUES ('call');
INSERT INTO my_words VALUES ('nail');
INSERT INTO my_words VALUES ('rat');
INSERT INTO my_words VALUES ('hat');

SELECT CONCAT(w1.my_word, w2.my_word) joined
FROM my_words w1, my_words w2
WHERE LENGTH(CONCAT(w1.my_word, w2.my_word)) = 8
ORDER BY RAND() LIMIT 5;

I can write the SQL statement at the end to generate a list of 5 random concatenated strings made up of 2 words, where the total length of the string is 8 characters.

That works fine for a simple table of data like I've got in the sample.

However, the "real" table I am working with contains about 6,200 lines.

If I try the same type of statement it takes 10 seconds to generate the 5 strings.

I guess the SQL is very inefficient as it is searching through the table twice each time and those tables aren't joined in any way.

I wondered if there is a simpler way to extract strings of words made up of 2 words from the table, where the length of the concatenated string is 8 characters long (though that can change - I'm just using 8 as an example).

Thanks


Update 1

Explain Plan of:

EXPLAIN
SELECT CONCAT(w1.fld_un, w2.fld_un) joined
FROM j_un w1
JOIN j_un w2 ON w1.fld_len = 8 - w2.fld_len
WHERE w2.fld_len < 8
ORDER BY RAND()
LIMIT 5;

id  select_type table   type    possible_keys   key     key_len     ref rows    Extra
1   SIMPLE      w2      range   un_len          un_len  5   \N      2694        Using where; Using temporary; Using filesort
1   SIMPLE      w1      ref     un_len          un_len  5   func    527         Using where

Update 2

I'm not sure if it's relevant, but the "fld_un" table has about 6,200 rows.

The "word" is held in the "fld_un" column.

The structure of the table is:

Field       Type            Null    Key     Default     Extra
fld_id      int(11)         NO      PRI     NULL        auto_increment
fld_un      varchar(255)    YES             NULL     
fld_cat_id  int(11)         YES     MUL     NULL     
fld_len     int(2)          NO      MUL     NULL    

These indexes exist on the table:

Keyname     Type    Cardinality Field
PRIMARY     PRIMARY 6318        fld_id
cat         INDEX   15          fld_cat_id
bob         INDEX   11          fld_len 

Does it matter that there is already a primary index on the table? I don't technically need that I suppose.

Statement:

SELECT CONCAT(word1, word2) joined
FROM (
    SELECT w1.fld_un word1, w2.fld_un word2
    FROM j_un2 w1
    JOIN j_un2 w2 ON w1.fld_len = 8 - w2.fld_len
    WHERE w2.fld_len < 8
    ORDER BY RAND()
    LIMIT 5) x;

Query took 23.6805 sec

Explain Plan:

id  select_type     table       type    possible_keys   key     key_len     ref     rows    Extra
1   PRIMARY         <derived2>  ALL     NULL            NULL    NULL        NULL    5    
2   DERIVED         w2          range   bob             bob     4           NULL    4627    Using where; Using temporary; Using filesort
2   DERIVED         w1          ref     bob             bob     4           func    527     Using where

When I revise the "bob" index to include 2 columns as suggested by Thorsten Kettner:

Keyname     Type    Cardinality Field
bob         INDEX   11          fld_len, fld_un

And re-test:

SELECT CONCAT(word1, word2) joined
FROM (
    SELECT w1.fld_un word1, w2.fld_un word2
    FROM j_un2 w1
    JOIN j_un2 w2 ON w1.fld_len = 8 - w2.fld_len
    WHERE w2.fld_len < 8
    ORDER BY RAND()
    LIMIT 5) x;

The query took 30.3394 sec to return 5 rows.

Explain Plan:

id  select_type     table       type    possible_keys   key     key_len     ref     rows    Extra
1   PRIMARY         <derived2>  ALL     NULL            NULL    NULL        NULL    5    
2   DERIVED         w2          range   bob             bob     4           NULL    4211    Using where; Using temporary; Using filesort
2   DERIVED         w1          ref     bob             bob     4           func    527     Using where

Update 3

Ran without "order by rand()" and it ran in 0.0011 sec!

Upvotes: 0

Views: 2043

Answers (2)

Barmar
Barmar

Reputation: 780974

You can add a column, e.g. word_length that contains the length of the word, and add an index on the word_length column. Normally it would be poor design to include data that can be derived from another column, but in this case you need to break purity for performance sake. Then your query can use a JOIN condition using this column:

SELECT CONCAT(w1.my_word, w2.my_word) joined
FROM my_words w1
JOIN my_words w2 ON w1.word_length = 8 - w2.word_length
WHERE w2.word_length < 8
ORDER BY RAND()
LIMIT 5

You can use INSERT and UPDATE triggers to fill in the word_length column automatically.

It might also be helpful to do the concatenation after filtering down to the 5 rows:

SELECT CONCAT(word1, word2) joined
FROM (
    SELECT w1.my_word word1, w2.my_word word2
    FROM my_words w1
    JOIN my_words w2 ON w1.word_length = 8 - w2.word_length
    WHERE w2.word_length < 8
    ORDER BY RAND()
    LIMIT 5) x

Upvotes: 4

Daniel
Daniel

Reputation: 4549

If you can add another column and index to index the length of the word, you can do a join ON clause with the second Word length = desired length - first word length.

I think the real slow part here though is the rand(). You might get better results if you don't do that on the cartesian product, but on each table individually. Although the results are different if you do that...

Upvotes: 0

Related Questions