Reputation: 25

What is the fastest way to search for anagrams?

I have a MySQL table with 267,751 words. I try to find the quickest way to find anagrams, without having to search through the entire table for each search, which would be incredibly inefficient.

For the sake of clarity: An anagram is a word that has been put together by changing the order of the letters of another word.

I came up with a method where I created a new column, where the letters in all the words are sorted alphabetically. Before I do a search, I sort the letters in the search word alphabetically, and search in the new column. This method turns out to be quite fast for exact anagrams (words with the same number of letters).

The problem is not finding exact anagrams. But to find anagrams you can make with one letter less, two letters less, three letters less, and all the way down to two letters. There are suddenly many combinations, and an average search time takes around 0,5 seconds which is bad.

There are many anagram search engines out there, so this should not be difficult, but I cannot come up with an effective way to do it. Does anyone have any ideas? How do they manage to do that so quickly?

Thanks

Upvotes: 2

Answers (4)

Rick James

Reputation: 142383

After pondering the wildcard problem and shortened word problem, here's another partial answer:

Have an INT UNSIGNED where the bottom 26 bits represent letters. There is no good way to use an INDEX with such, but it does provide a reasonably fast, space-efficient, way to filter the 267K words.

It does not handle doubled letters (the two Os of BOOK). But it would filter for "BKO" or "BKO*" about equally. Handling the extra O would need app code or one of the extra SQL tricks.

(Rob Ruchte's deleted Answer goes into more details.)

Upvotes: 0

Rick James

Reputation: 142383

Plan 3 -- Pre-build

The table would have 267K rows. Each row would contain the alphabetized variant of a word, and the all valid words containing those letters:

This approach is a very fast PK lookup:

PK      TEXT
art  -- art rat tar smart artwork start ...
bkoo -- book books bookmark bookworm ...
opst -- post spot spotty fencepost ...
abbr -- barb barber abbreviate ...

This approach uses FULLTEXT, which is quite fast:

TEXT with FULLTEXT index
art rat tar smart artwork start ...
bkoo book books bookmark bookworm ...
opst post spot spotty fencepost ...
abbr barb barber abbreviate ...

Since you have the lexicon, the problem now is to generate this 2-column table.

If you want to match "x*" and this is Scrabble, then the existing tiles on the board must either be a single letter or a valid word. Example: E_CITE -> EXCITE. By making use of this principle, you can avoid needing strings of 50K words in a row of the table (all words with "E"). (But this is a really messy algorithm. Is it worth it to get the high speed of FULLEXT?)

There are problems with FULLTEXT -- stopwords, short words, automatic tacking on of endings.

Upvotes: 0

ysth

Reputation: 98398

Instead of just storing a new column with letters sorted alphabetically, have a new column with a like expression, with the letters sorted alphabetically and a % before and after and between different letters, like:

word	word_like
b	%b%
bk	%b%k%
bo	%b%o%
boko	%b%k%o%o%
boo	%b%o%o%
book	%b%k%o%o%
k	%k%
kb	%b%k%
ko	%k%o%
kob	%b%k%o%
o	%o%
ob	%b%o%
ok	%k%o%

and use it for your search:

select word from words where 'bkoo' like word_like;

after changing 'book' to have letters sorted alphabetically

Upvotes: 0

Rick James

Reputation: 142383

The first step is simple and very efficient, as you mentioned.

Build a table with 2 (or more) columns:

word VARCHAR(..),
sorted VARCHAR(..),
PRIMARY KEY(word),
INDEX(sorted)

sorted is has the letters of word, but sorted. For example, with 'post':

post -- opst
stop -- opst
pots -- opst
spot -- opst

That is, this will find all the anagrams:

SELECT GROUP_CONCAT(word) 
    FROM anagrams
    WHERE sorted = ?

when you provide the letters sorted.

For 'rat':

art -- art  -- Notice that the `word` == `sorted` in one case
rat -- art
tar -- art

The second step is tricker...

Expand that sorted column into a simple misspelled column by deleting one letter:

opst -- pst
opst -- ost
opst -- opt
opst -- ops

This is a technique for discovering misspellings of these types:

One letter dropped
One letter added
Adjacent pair of letters transposed

In this case, you need to say

WHERE misspell IN ('opst', 'pst', 'ost', 'opt', 'ops')

And, of course, INDEX(misspell)

(Details left as an exercise.)

The third step is more of the same -- shorter and shorter strings in the IN.

(Again, details left as an exercise.)

Upvotes: 2

What is the fastest way to search for anagrams?

Answers (4)

Related Questions