Reputation: 41
I'm using the symspellpy module in Python for query correction. It is really useful and fast, but I'm having a issue with it.
Is there a way to force Symspell to return more than one recommendation for correction. I need it to analyse a better correction based on my application.
I'm calling Symspell like this:
suggestions = sym_spell.lookup(query, VERBOSITY_ALL, max_edit_distance=3)
Example of what I'm trying to do:
query = "resende"
. The return that I want ["resende", "rezende"]
. What the method returns ["resende"]
. Note that both "resende" and "rezende" are in my dictionary.
Upvotes: 1
Views: 1258
Reputation: 1022
Merely a typo. Change the underscore in
Verbosity_ALL
... to
Verbosity.ALL
The three options are CLOSEST
, TOP
and ALL
Couple of other things in SymSpell ...
Described here
Supported edit distance algorithm choices.
LEVENSHTEIN = 0 Levenshtein algorithm
DAMERAU_OSA = 1 Damerau optimal string alignment algorithm (default)
LEVENSHTEIN_FAST = 2 Fast Levenshtein algorithm
DAMERAU_OSA_FAST = 3 Fast Damerau optimal string alignment algorithm
DAMERAU_OSA # high count/frequency wins when using .ALL but distances tied?
LEVENSHTEIN # lowest edit distance wins (fewest changes needed)
To change from the default, overwrite it with one of them:
from symspellpy.editdistance import DistanceAlgorithm
sym_spell._distance_algorithm = DistanceAlgorithm.LEVENSHTEIN
word = 'something'
matches = sym_spell.lookup(word, Verbosity.ALL, max_edit_distance=2)
for match in matches: # match is ... term, distance, count
print(f'{word} -> {match.term} {match.distance} {match.count}')
SymSpell can only read the dictionary of ok words from a file currently (Apr 2022) however this can be added inside symspellpy.py to make it able to read from a collections Counter() output dict or other dictionary of words : counts
, a mere quick hack that works for my purposes ...
def load_Counter_dictionary(self, counts_each):
for key, count in counts_each.items():
self.create_dictionary_entry(key, count)
Can then drop the use of load_dictionary(), for something like this instead ...
sym_spell.load_Counter_dictionary( Counter(words_list) )
The reason I resorted to that is a million+ record csv file was already loaded into a pandas dataframe containing a column of codes (think words) with some of them in large numbers (likely correct) along with outliers to be corrected and a column already made containing their counts each. So rather than saving the counts dict to file (expensive) and the reload by SymSpell, this is direct and efficient.
Upvotes: 1