Reputation: 25
I have a MySQL table with 267,751 words. I try to find the quickest way to find anagrams, without having to search through the entire table for each search, which would be incredibly inefficient.
For the sake of clarity: An anagram is a word that has been put together by changing the order of the letters of another word.
I came up with a method where I created a new column, where the letters in all the words are sorted alphabetically. Before I do a search, I sort the letters in the search word alphabetically, and search in the new column. This method turns out to be quite fast for exact anagrams (words with the same number of letters).
The problem is not finding exact anagrams. But to find anagrams you can make with one letter less, two letters less, three letters less, and all the way down to two letters. There are suddenly many combinations, and an average search time takes around 0,5 seconds which is bad.
There are many anagram search engines out there, so this should not be difficult, but I cannot come up with an effective way to do it. Does anyone have any ideas? How do they manage to do that so quickly?
Thanks
Upvotes: 2
Views: 804
Reputation: 142383
After pondering the wildcard problem and shortened word problem, here's another partial answer:
Have an INT UNSIGNED
where the bottom 26 bits represent letters. There is no good way to use an INDEX
with such, but it does provide a reasonably fast, space-efficient, way to filter the 267K words.
It does not handle doubled letters (the two Os of BOOK). But it would filter for "BKO" or "BKO*" about equally. Handling the extra O would need app code or one of the extra SQL tricks.
(Rob Ruchte's deleted Answer goes into more details.)
Upvotes: 0
Reputation: 142383
Plan 3 -- Pre-build
The table would have 267K rows. Each row would contain the alphabetized variant of a word, and the all valid words containing those letters:
This approach is a very fast PK lookup:
PK TEXT
art -- art rat tar smart artwork start ...
bkoo -- book books bookmark bookworm ...
opst -- post spot spotty fencepost ...
abbr -- barb barber abbreviate ...
This approach uses FULLTEXT, which is quite fast:
TEXT with FULLTEXT index
art rat tar smart artwork start ...
bkoo book books bookmark bookworm ...
opst post spot spotty fencepost ...
abbr barb barber abbreviate ...
Since you have the lexicon, the problem now is to generate this 2-column table.
If you want to match "x*" and this is Scrabble, then the existing tiles on the board must either be a single letter or a valid word. Example: E_CITE -> EXCITE. By making use of this principle, you can avoid needing strings of 50K words in a row of the table (all words with "E"). (But this is a really messy algorithm. Is it worth it to get the high speed of FULLEXT?)
There are problems with FULLTEXT -- stopwords, short words, automatic tacking on of endings.
Upvotes: 0
Reputation: 98398
Instead of just storing a new column with letters sorted alphabetically, have a new column with a like expression, with the letters sorted alphabetically and a % before and after and between different letters, like:
word | word_like |
---|---|
b | %b% |
bk | %b%k% |
bo | %b%o% |
boko | %b%k%o%o% |
boo | %b%o%o% |
book | %b%k%o%o% |
k | %k% |
kb | %b%k% |
ko | %k%o% |
kob | %b%k%o% |
o | %o% |
ob | %b%o% |
ok | %k%o% |
and use it for your search:
select word from words where 'bkoo' like word_like;
after changing 'book' to have letters sorted alphabetically
Upvotes: 0
Reputation: 142383
The first step is simple and very efficient, as you mentioned.
Build a table with 2 (or more) columns:
word VARCHAR(..),
sorted VARCHAR(..),
PRIMARY KEY(word),
INDEX(sorted)
sorted
is has the letters of word
, but sorted. For example, with 'post':
post -- opst
stop -- opst
pots -- opst
spot -- opst
That is, this will find all the anagrams:
SELECT GROUP_CONCAT(word)
FROM anagrams
WHERE sorted = ?
when you provide the letters sorted.
For 'rat':
art -- art -- Notice that the `word` == `sorted` in one case
rat -- art
tar -- art
The second step is tricker...
Expand that sorted
column into a simple misspelled
column by deleting one letter:
opst -- pst
opst -- ost
opst -- opt
opst -- ops
This is a technique for discovering misspellings of these types:
In this case, you need to say
WHERE misspell IN ('opst', 'pst', 'ost', 'opt', 'ops')
And, of course, INDEX(misspell)
(Details left as an exercise.)
The third step is more of the same -- shorter and shorter strings in the IN
.
(Again, details left as an exercise.)
Upvotes: 2