Reputation: 233
I have an English dictionary in a MySQL database with just over 250K entries, and I'm using a simple ruby front-end to search it using wildcards at the beginning of the strings. So far I've been doing it like this:
SELECT * FROM words WHERE word LIKE '_e__o'
or even
SELECT * FROM words WHERE word LIKE '____s'
I always know the exact length of the word, but all but a single character are potentially unknown.
This is slower than molasses, about fifteen times slower than a similar query without the leading wildcard because the index for the column cannot be used.
I've tried a few methods to narrow the scope of the search. For example, I've added 26 additional columns containing each word's individual letter counts and narrow the search using those first. I've also tried narrowing by word length. These methods made almost no difference, thanks to the inherent inefficiency of leading-wildcard searches. I've experimented with the REGEXP statement, which is even slower.
SQLite and PostgreSQL are just as limited as MySQL, and though I have limited experience with NoSQL systems, my research gives me the impression that they excel at scalability, not performance of the kind I need.
My question then, is where should I look for a solution? Should I continue trying to find a way to optimize my queries or add supplementary columns that can narrow my potential recordset? Are there systems designed specifically to accomplish fast wildcard searching in this vein?
Upvotes: 11
Views: 2880
Reputation: 54984
A quick way to get it down by a factor of 10 or so is to create a column for string length, put an index on it, and use it in the where clause.
Upvotes: 1
Reputation: 1299
You can index this query fully without having to scan any more than the size of the result set which is optimal.
Create a lookup table like so:
Table: lookup
pattern word_id
_o_s_ 1
_ous_ 1
...
Which references your word table:
Table: word
word_id word
1 mouse
Put an index on pattern and execute a select like so:
select w.word
from lookup l, word w
where l.pattern = '_ous_' and
l.word_id = w.word_id;
Of course you'll need a little ruby script to create this lookup table where pattern is every possible pattern for every word in the dictionary. In other words the patterns for mouse would be:
m____
mo___
mou__
mous_
mouse
_o___
_ou__
...
The ruby to generate all patterns for a given word could look like:
def generate_patterns word
return [word, '_'] if word.size == 1
generate_patterns(word[1..-1]).map do |sub_word|
[word[0] + sub_word, '_' + sub_word]
end.flatten
end
For example:
> generate_patterns 'mouse'
mouse
_ouse
m_use
__use
mo_se
_o_se
m__se
___se
mou_e
_ou_e
m_u_e
__u_e
mo__e
_o__e
m___e
____e
mous_
_ous_
m_us_
__us_
mo_s_
_o_s_
m__s_
___s_
mou__
_ou__
m_u__
__u__
mo___
_o___
m____
_____
Upvotes: 1
Reputation: 115530
This is more of an exercise than a real-life solution. The idea is to split the words into characters.
Lets design the needed table first. I assume your words
table has the columns word_id, word, size
:
CREATE TABLE letter_search
( word_id INT NOT NULL
, position UNSIGNED TINYINT NOT NULL
, letter CHAR(1) NOT NULL
, PRIMARY KEY (word_id, position)
, FOREIGN KEY (word_id)
REFERENCES words (word_id)
ON DELETE CASCADE
ON UPDATE CASCADE
, INDEX position_letter_idx (position, letter)
, INDEX letter_idx (letter)
) ENGINE = InnoDB ;
We'll need an auxilary "numbers" table:
CREATE TABLE num
( i UNSIGNED TINYINT NOT NULL
, PRIMARY KEY (i)
) ;
INSERT INTO num (i) --- I suppose you don't have
VALUES --- words with 100 letters
(1), (2), ..., (100) ;
To populate our letter_search
table:
INSERT INTO letter_search
( word_id, position, letter )
SELECT
w.word_id
, num.i
, SUBSTRING( w.word, num.i, 1 )
FROM
words AS w
JOIN
num
ON num.i <= w.size
The size of this search table will be about 10 * 250K rows (where 10, put the average size of your words).
Finally, the query:
SELECT * FROM words WHERE word LIKE '_e__o'
will be written as:
SELECT w.*
FROM
words AS w
JOIN
letter_search AS s2
ON (s2.position, s2.letter, s2.word_id) = (2, 'e', w.word_id)
JOIN
letter_search AS s5
ON (s5.position, s5.letter, s5.word_id) = (5, 'o', w.word_id)
WHERE
w.size = 5
Upvotes: 0
Reputation:
With PostgreSQL 9.1 and the pg_trgm extension you can create indexes that are usable for a like condition you are describing.
For an example see here: http://www.depesz.com/2011/02/19/waiting-for-9-1-faster-likeilike/
I verified it on a table with 300k rows using LIKE '____1'
and it does use such an index. It took about 120ms to count the number of rows in that table (on an old laptop). Interesting enough the expression LIKE 'd___1'
is not faster, it's about the same speed.
It also depends on the number of characters in the search term, the longe it gets, the slower it will be as far as I can tell.
You would need to check with your data if the performance is acceptable.
Upvotes: 5
Reputation: 118611
It all comes down to indexing.
You can create table like:
create table letter_index (
id integer not null primary key,
letter varchar(1),
position integer
)
create unique index letter_index_i1 (letter, position)
create table letter_index_words (
letter_index_id integer,
word_id integer
)
Then index all of your words.
If you want a list of all words with an 'e' in the 2nd position:
select words.* from words, letter_index_word liw, letter_index li
where li.letter = 'e' and li.position = 2
and liw.letter_index_id = li.id
and words.id = liw.word_id
If you want all of the words with 'e' in the 2nd position, and 's' in the fifth position:
select words.* from words, letter_index_word liw, letter_index li
where li.letter = 'e' and li.position = 2
and liw.letter_index_id = li.id
and words.id = liw.word_id
and words.id in (
select liw.word_id from letter_index_word liw, letter_index li
where li.letter = 's' and li.position = 5
and liw.letter_index_id = li.id
)
Or you can run two simple queries and merge the results yourself.
Of course, simply caching and iterating through the list in memory is likely faster than any of these. But not fast enough to be worth loading the 250K list from the DB every time.
Upvotes: 1
Reputation: 24281
I'm assuming that the time initially taken to insert the words and set-up your indexing is inconsequential. Also, you would not do updates to the word list very often so it's basically static data.
You could try an approach like this:-
If you RDBMS allows it, you would be better off with a single table and partitions by word length.
If that is still not fast enough, you could further split it by length and known letter. For example, you could have a table listing all 8 letter words containing a "Z".
When you query, you know you have a 8 letter word containing an "E" and "Z". First query the data dictionary to see which letter is rarest in 8 letter words, and then scan that table. By query the data dictionary, I mean figure out if table words_8E
or table words_8z
has the least number of records.
Regarding Normal Forms and Good Practice
This is not the sort of thing I would typically recommend when modelling data. In your particular case, storing the entire word in a single character column is not actually in 1st normal form. This is because you care about individual elements within the word. Given your use case, a word is a list of letters than a single word. As always, how to model depends on what you care about.
Your queries are giving you trouble because it's not in first normal form.
The fully normalised model for this problem would have two tables: word (WordId PK) and WordLetter (WordId PK, Position PK, Letter). You would then query for all words with multiple WHERE EXISTS a letter in the appropriate position.
While correct according to database theory, I do not think this will perform well.
Upvotes: 1
Reputation: 4427
Create an in memory lookup table solution: You could have a sorted table for each length.
Then to match, say you know 4th and 8th letter, loop through the words checking only each 4th letter. They are all same length so that will be quick. Only if letter matches check 8th letter.
it's brute force, but will be fast. Let's say worst case you have 50,000 8 letter words. Thats 50,000 comparisons. assuming ruby run time perf issues it should still be < 1sec.
Memory required would be 250k x 10. So 2.5 Meg.
Upvotes: 0
Reputation: 13097
You can try using Apache Lucene, a full text search engine. It was made to answer queries like this, so you might have more luck.
Wildcard searching with lucene.
Upvotes: 0