Reputation: 1093
I'm dealing with a list of strings that may contain some additional letters to its original spelling, for example:
words = ['whyyyyyy', 'heyyyy', 'alrighttttt', 'cool', 'mmmmonday']
I want to pre-process these strings so that they are spelt correctly, to retrieve a new list:
cleaned_words = ['why', 'hey', 'alright', 'cool', 'monday']
The length of the sequence of the duplicated letter can vary, however, obviously cool
should maintain its spelling.
I'm unaware of any python libraries that do this, and I'd preferably like to try and avoid hard coding it.
I've tried this: http://norvig.com/spell-correct.html but the more words you put in the text file, it seems there's more chance of it suggesting the incorrect spelling, so it's never actually getting it right, even without the removed additional letters. For example, eel
becomes teel
...
Thanks in advance.
Upvotes: 1
Views: 889
Reputation: 3495
If you were to download a text file of all english words to check against, this is another way that could work.
I've not tested it but you get the idea. It iterates through the letters, and if the current letter matches the last one, it'll remove the letter from the word. If it narrows down those letters to 1, and there is still no valid word, it'll reset the word back to normal and continue until the next duplicate characters are found.
words = ['whyyyyyy', 'heyyyy', 'alrighttttt', 'cool', 'mmmmonday']
import urllib2
word_list = set(i.lower() for i in urllib2.urlopen('https://raw.githubusercontent.com/eneko/data-repository/master/data/words.txt').read().split('\n'))
found_words = []
for word in (i.lower() for i in words):
#Check word doesn't exist already
if word in word_list:
found_words.append(word)
continue
last_char = None
i = 0
current_word = word
while i < len(current_word):
#Check if it's a duplicate character
if current_word[i] == last_char:
current_word = current_word[:i] + current_word[i + 1:]
#Reset word if no more duplicate characters
else:
current_word = word
i += 1
last_char = current_word[i]
#Word has been found
if current_word in word_list:
found_words.append(current_word)
break
print found_words
#['why', 'hey', 'alright', 'cool', 'monday']
Upvotes: 1
Reputation: 1017
Well, a crude way:
words = ['whyyyyyy', 'heyyyy', 'alrighttttt', 'cool', 'mmmmonday']
res = []
for word in words:
while word[-2]==word[-1]:
word = word[:-1]
while word[0]==word[1]:
word = word[1:]
res.append(word)
print(res)
Result:
['why', 'hey', 'alright', 'cool', 'monday']
Upvotes: 0
Reputation: 4449
If it's only repeated letters you want to strip then using the regular expression module re
might help:
>>> import re
>>> re.sub(r'(.)\1+$', r'\1', 'cool')
'cool'
>>> re.sub(r'(.)\1+$', r'\1', 'coolllll')
'cool'
(It leaves 'cool' untouched.)
For leading repeated characters the correct substitution would be:
>>> re.sub(r'^(.)\1+', r'\1', 'mmmmonday')
'monday'
Of course this fails for words that legitimately start or end with repeated letters ...
Upvotes: 2