Reputation: 1903
Is there some form of hashing algorithm that produces similar numerical values for similar words? I imagine there would be a number of false positives, but it seems like something that could be useful for search pruning.
EDIT: Soundex is neat and may come in handy, but ideally, I want something that behave something like this: abs(f('horse') - f('hoarse')) < abs(f('horse') - f('goat'))
Upvotes: 3
Views: 699
Reputation: 37945
What you are talking about is called Locality-sensitive Hashing. It can be applied to different types of input (images, music, text, positions in space, whatever you need).
Unfortunately (and despite searching) I couldn't find any practical implementation of an LSH algorithm for strings.
Upvotes: 2
Reputation: 3509
Checkout the Soundex algorithm on wikipedia, you haven't specified a language, but there are links to example implementations in multiple languages there. Obviously, this will give you a string hash thats the same for similar sounding words, and you want an integer, but you could then apply the string->integer hashing method they use in Boost.Hash.
Edit: To clarify, here is an example C++ implementation...
#include <boost/foreach.hpp>
#include <boost/functional/hash.hpp>
#include <algorithm>
#include <string>
#include <iostream>
char SoundexChar(char ch)
{
switch (ch)
{
case 'B':
case 'F':
case 'P':
case 'V':
return '1';
case 'C':
case 'G':
case 'J':
case 'K':
case 'Q':
case 'S':
case 'X':
case 'Z':
return '2';
case 'D':
case 'T':
return '3';
case 'M':
case 'N':
return '5';
case 'R':
return '6';
default:
return '.';
}
}
std::size_t SoundexHash(const std::string& word)
{
std::string soundex;
soundex.reserve(word.length());
BOOST_FOREACH(char ch, word)
{
if (std::isalpha(ch))
{
ch = std::toupper(ch);
if (soundex.length() == 0)
{
soundex.append(1, ch);
}
else
{
ch = SoundexChar(ch);
if (soundex.at(soundex.length() - 1) != ch)
{
soundex.append(1, ch);
}
}
}
}
soundex.erase(std::remove(soundex.begin(), soundex.end(), '.'), soundex.end());
if (soundex.length() < 4)
{
soundex.append(4 - soundex.length(), '0');
}
else if (soundex.length() > 4)
{
soundex = soundex.substr(0, 4);
}
return boost::hash_value(soundex);
}
int main()
{
std::cout << "Color = " << SoundexHash("Color") << std::endl;
std::cout << "Colour = " << SoundexHash("Colour") << std::endl;
std::cout << "Gray = " << SoundexHash("Gray") << std::endl;
std::cout << "Grey = " << SoundexHash("Grey") << std::endl;
return 0;
}
Upvotes: 0
Reputation: 31428
You could always try Soundex and see if it fits your needs.
Upvotes: 0
Reputation: 3297
The Soundex algorithm generates strings of keys corresponding to the phonemes in the input word. http://www.archives.gov/research/census/soundex.html
If you only want to compare similarity between strings, try Levenstein Distance. http://en.wikipedia.org/wiki/Levenshtein_distance
Upvotes: 3