Daniel
Daniel

Reputation: 233

Fast(er) method for wildcard searching of 250K+ strings

I have an English dictionary in a MySQL database with just over 250K entries, and I'm using a simple ruby front-end to search it using wildcards at the beginning of the strings. So far I've been doing it like this:

SELECT * FROM words WHERE word LIKE '_e__o'

or even

SELECT * FROM words WHERE word LIKE '____s'

I always know the exact length of the word, but all but a single character are potentially unknown.

This is slower than molasses, about fifteen times slower than a similar query without the leading wildcard because the index for the column cannot be used.

I've tried a few methods to narrow the scope of the search. For example, I've added 26 additional columns containing each word's individual letter counts and narrow the search using those first. I've also tried narrowing by word length. These methods made almost no difference, thanks to the inherent inefficiency of leading-wildcard searches. I've experimented with the REGEXP statement, which is even slower.

SQLite and PostgreSQL are just as limited as MySQL, and though I have limited experience with NoSQL systems, my research gives me the impression that they excel at scalability, not performance of the kind I need.

My question then, is where should I look for a solution? Should I continue trying to find a way to optimize my queries or add supplementary columns that can narrow my potential recordset? Are there systems designed specifically to accomplish fast wildcard searching in this vein?

Upvotes: 11

Views: 2880

Answers (8)

pguardiario
pguardiario

Reputation: 54984

A quick way to get it down by a factor of 10 or so is to create a column for string length, put an index on it, and use it in the where clause.

Upvotes: 1

Dark Castle
Dark Castle

Reputation: 1299

You can index this query fully without having to scan any more than the size of the result set which is optimal.

Create a lookup table like so:

Table:  lookup
pattern     word_id
_o_s_       1
_ous_       1
...

Which references your word table:

Table:  word
word_id     word
1           mouse

Put an index on pattern and execute a select like so:

select w.word
from lookup l, word w
where l.pattern = '_ous_' and
l.word_id = w.word_id;

Of course you'll need a little ruby script to create this lookup table where pattern is every possible pattern for every word in the dictionary. In other words the patterns for mouse would be:

m____
mo___
mou__
mous_
mouse
_o___
_ou__
...

The ruby to generate all patterns for a given word could look like:

def generate_patterns word
  return [word, '_'] if word.size == 1
  generate_patterns(word[1..-1]).map do |sub_word|
    [word[0] + sub_word, '_' + sub_word]
  end.flatten
end

For example:

> generate_patterns 'mouse'
mouse
_ouse
m_use
__use
mo_se
_o_se
m__se
___se
mou_e
_ou_e
m_u_e
__u_e
mo__e
_o__e
m___e
____e
mous_
_ous_
m_us_
__us_
mo_s_
_o_s_
m__s_
___s_
mou__
_ou__
m_u__
__u__
mo___
_o___
m____
_____

Upvotes: 1

ypercubeᵀᴹ
ypercubeᵀᴹ

Reputation: 115530

This is more of an exercise than a real-life solution. The idea is to split the words into characters.

Lets design the needed table first. I assume your words table has the columns word_id, word, size:

CREATE TABLE letter_search
( word_id INT NOT NULL
, position UNSIGNED TINYINT NOT NULL
, letter CHAR(1) NOT NULL
, PRIMARY KEY (word_id, position)
, FOREIGN KEY (word_id)
    REFERENCES words (word_id)
      ON DELETE CASCADE 
      ON UPDATE CASCADE
, INDEX position_letter_idx (position, letter)
, INDEX letter_idx (letter)
) ENGINE = InnoDB ;

We'll need an auxilary "numbers" table:

CREATE TABLE num
( i UNSIGNED TINYINT NOT NULL
, PRIMARY KEY (i)
) ;

INSERT INTO num (i)               --- I suppose you don't have
VALUES                            --- words with 100 letters
  (1), (2), ..., (100) ;

To populate our letter_search table:

INSERT INTO letter_search
  ( word_id, position, letter )
SELECT
    w.word_id
  , num.i
  , SUBSTRING( w.word, num.i, 1 ) 
FROM 
    words AS w
  JOIN
    num
       ON num.i <= w.size

The size of this search table will be about 10 * 250K rows (where 10, put the average size of your words).


Finally, the query:

SELECT * FROM words WHERE word LIKE '_e__o'

will be written as:

SELECT w.* 
FROM 
    words AS w
  JOIN
    letter_search AS s2
        ON (s2.position, s2.letter, s2.word_id) = (2, 'e', w.word_id)
  JOIN
    letter_search AS s5
        ON (s5.position, s5.letter, s5.word_id) = (5, 'o', w.word_id)
WHERE
    w.size = 5

Upvotes: 0

user330315
user330315

Reputation:

With PostgreSQL 9.1 and the pg_trgm extension you can create indexes that are usable for a like condition you are describing.

For an example see here: http://www.depesz.com/2011/02/19/waiting-for-9-1-faster-likeilike/

I verified it on a table with 300k rows using LIKE '____1' and it does use such an index. It took about 120ms to count the number of rows in that table (on an old laptop). Interesting enough the expression LIKE 'd___1' is not faster, it's about the same speed.

It also depends on the number of characters in the search term, the longe it gets, the slower it will be as far as I can tell.

You would need to check with your data if the performance is acceptable.

Upvotes: 5

Will Hartung
Will Hartung

Reputation: 118611

It all comes down to indexing.

You can create table like:

create table letter_index (
    id integer not null primary key,
    letter varchar(1),
    position integer
)

create unique index letter_index_i1 (letter, position)

create table letter_index_words (
    letter_index_id integer,
    word_id integer
)

Then index all of your words.

If you want a list of all words with an 'e' in the 2nd position:

select words.* from words, letter_index_word liw, letter_index li
where li.letter = 'e' and li.position = 2
and liw.letter_index_id = li.id
and words.id = liw.word_id

If you want all of the words with 'e' in the 2nd position, and 's' in the fifth position:

select words.* from words, letter_index_word liw, letter_index li
where li.letter = 'e' and li.position = 2
and liw.letter_index_id = li.id
and words.id = liw.word_id
and words.id in (
    select liw.word_id from letter_index_word liw, letter_index li
    where li.letter = 's' and li.position = 5
    and liw.letter_index_id = li.id
)

Or you can run two simple queries and merge the results yourself.

Of course, simply caching and iterating through the list in memory is likely faster than any of these. But not fast enough to be worth loading the 250K list from the DB every time.

Upvotes: 1

WW.
WW.

Reputation: 24281

I'm assuming that the time initially taken to insert the words and set-up your indexing is inconsequential. Also, you would not do updates to the word list very often so it's basically static data.

You could try an approach like this:-

  • Since you always know the word length, create a table containing all the words of length 1, another table of words length 2, etc.
  • When you conduct a query, select out of the appropriate table based on word length. It will still need to do a full scan of that table.

If you RDBMS allows it, you would be better off with a single table and partitions by word length.

If that is still not fast enough, you could further split it by length and known letter. For example, you could have a table listing all 8 letter words containing a "Z".

When you query, you know you have a 8 letter word containing an "E" and "Z". First query the data dictionary to see which letter is rarest in 8 letter words, and then scan that table. By query the data dictionary, I mean figure out if table words_8E or table words_8z has the least number of records.

Regarding Normal Forms and Good Practice

This is not the sort of thing I would typically recommend when modelling data. In your particular case, storing the entire word in a single character column is not actually in 1st normal form. This is because you care about individual elements within the word. Given your use case, a word is a list of letters than a single word. As always, how to model depends on what you care about.

Your queries are giving you trouble because it's not in first normal form.

The fully normalised model for this problem would have two tables: word (WordId PK) and WordLetter (WordId PK, Position PK, Letter). You would then query for all words with multiple WHERE EXISTS a letter in the appropriate position.

While correct according to database theory, I do not think this will perform well.

Upvotes: 1

peterept
peterept

Reputation: 4427

Create an in memory lookup table solution: You could have a sorted table for each length.

Then to match, say you know 4th and 8th letter, loop through the words checking only each 4th letter. They are all same length so that will be quick. Only if letter matches check 8th letter.

it's brute force, but will be fast. Let's say worst case you have 50,000 8 letter words. Thats 50,000 comparisons. assuming ruby run time perf issues it should still be < 1sec.

Memory required would be 250k x 10. So 2.5 Meg.

Upvotes: 0

Oleksi
Oleksi

Reputation: 13097

You can try using Apache Lucene, a full text search engine. It was made to answer queries like this, so you might have more luck.

Wildcard searching with lucene.

Upvotes: 0

Related Questions