Reputation: 5444
I want to group similar strings, however, I would prefer to be smart to catch whether conventions like '/' or '-' are diverged instead of letter differences.
Given following input:
moose
mouse
mo/os/e
m.ouse
alpha = ['/','.']
I want to group strings with respect to restricted set of letters, where output should be:
moose
mo/os/e
mouse
m.ouse
I'm aware I can get similar strings using difflib but it doesn't provide option for limiting the alphabet. Is there another way of doing this? Thank you.
Update:
Instead of restricted letters, alphas are simpler to implement by just checking for occurrences. Therefore, I've changed the title.
Upvotes: 0
Views: 348
Reputation: 54303
Since you want to group words, you should probably use groupby
.
You just need to define a function which deletes alpha
chars (e.g. with str.translate
), and you can apply sort
and groupby
to your data:
from itertools import groupby
words = ['moose', 'mouse', 'mo/os/e', 'm.ouse']
alpha = ['/','.']
alpha_table = str.maketrans('', '', ''.join(alpha))
def remove_alphas(word):
return word.lower().translate(alpha_table)
words.sort(key=remove_alphas)
print(words)
# ['moose', 'mo/os/e', 'mouse', 'm.ouse'] # <- Words are sorted correctly.
for common_word, same_words in groupby(words, remove_alphas):
print(common_word)
print(list(same_words))
# moose
# ['moose', 'mo/os/e']
# mouse
# ['mouse', 'm.ouse']
Upvotes: 1
Reputation: 1546
Here is an idea that takes a few (easy) steps:
import re
example_strings = ['m/oose', 'moose', 'mouse', 'm.ouse', 'ca...t', 'ca..//t', 'cat']
indexed_strings = list(enumerate(example_strings))
# regex to match restricted alphabet
restricted = re.compile('[/\.]')
# dictionary to store strings with restricted char
restricted_dict = {}
for (idx, string) in indexed_strings:
if restricted.search(string):
# storing the string with a restricted char by its index
restricted_dict[idx] = string
# stripping the restricted char temporarily and returning to the list
indexed_strings[idx] = (idx, restricted.sub('', string))
indexed_strings.sort(key=lambda x: x[1])
# make a new list for the final set of strings
final_strings = []
for (idx, string) in indexed_strings:
if idx in restricted_dict:
final_strings.append(restricted_dict[idx])
else:
final_strings.append(string)
Result: ['ca...t', 'ca..//t', 'cat', 'm/oose', 'moose', 'mouse', 'm.ouse']
Upvotes: 1
Reputation: 624
Maybe something like:
from collections import defaultdict
container = defaultdict(list)
for word in words:
container[''.join(item for item in word if item not in alpha)].append(word)
Upvotes: 2