hurturk
hurturk

Reputation: 5444

Finding similar strings with restricted alpha characters using Python

I want to group similar strings, however, I would prefer to be smart to catch whether conventions like '/' or '-' are diverged instead of letter differences.

Given following input:

moose
mouse
mo/os/e
m.ouse

alpha = ['/','.']

I want to group strings with respect to restricted set of letters, where output should be:

moose
mo/os/e

mouse
m.ouse

I'm aware I can get similar strings using difflib but it doesn't provide option for limiting the alphabet. Is there another way of doing this? Thank you.

Update:

Instead of restricted letters, alphas are simpler to implement by just checking for occurrences. Therefore, I've changed the title.

Upvotes: 0

Views: 348

Answers (3)

Eric Duminil
Eric Duminil

Reputation: 54303

Since you want to group words, you should probably use groupby.

You just need to define a function which deletes alpha chars (e.g. with str.translate), and you can apply sort and groupby to your data:

from itertools import groupby

words = ['moose', 'mouse', 'mo/os/e', 'm.ouse']
alpha = ['/','.']

alpha_table = str.maketrans('', '', ''.join(alpha))

def remove_alphas(word):
    return word.lower().translate(alpha_table)

words.sort(key=remove_alphas)
print(words)
# ['moose', 'mo/os/e', 'mouse', 'm.ouse'] # <- Words are sorted correctly.

for common_word, same_words in groupby(words, remove_alphas):
    print(common_word)
    print(list(same_words))
# moose
# ['moose', 'mo/os/e']
# mouse
# ['mouse', 'm.ouse']

Upvotes: 1

isosceleswheel
isosceleswheel

Reputation: 1546

Here is an idea that takes a few (easy) steps:

import re
example_strings = ['m/oose', 'moose', 'mouse', 'm.ouse', 'ca...t', 'ca..//t', 'cat']

1. Index all of your strings so you can refer back to them by index later:

indexed_strings = list(enumerate(example_strings))

2. Store all strings with restricted characters in a dictionary using index as the key, string as the value. Then remove the restricted chars temporarily for sorting:

# regex to match restricted alphabet
restricted = re.compile('[/\.]')
# dictionary to store strings with restricted char
restricted_dict = {}
for (idx, string) in indexed_strings:
    if restricted.search(string):
        # storing the string with a restricted char by its index
        restricted_dict[idx] = string
        # stripping the restricted char temporarily and returning to the list
        indexed_strings[idx] = (idx, restricted.sub('', string))

3. Sort the cleaned list of strings by string values, then iterate over the strings once more and replace the stripped strings with their original values:

indexed_strings.sort(key=lambda x: x[1])
# make a new list for the final set of strings
final_strings = []
for (idx, string) in indexed_strings:
    if idx in restricted_dict:
        final_strings.append(restricted_dict[idx])
    else:
        final_strings.append(string)

Result: ['ca...t', 'ca..//t', 'cat', 'm/oose', 'moose', 'mouse', 'm.ouse']

Upvotes: 1

hello world
hello world

Reputation: 624

Maybe something like:

from collections import defaultdict

container = defaultdict(list)
for word in words:
    container[''.join(item for item in word if item not in alpha)].append(word)

Upvotes: 2

Related Questions