Reputation: 2304
I have the following dataset:
a_df = pd.DataFrame({'id':[1,2,3,4,5],'text':['This was fuuuuun','aaaawesome','Hiiigh altitude','Oops','See you']})
a_df
id text
0 1 This was fuuuuun
1 2 aaaawesome
2 3 Hiiigh altitude
3 4 Oops
4 5 See you
Some words are misspelled. One rule to apply is to that, if I see three or more vowels or consonants, then I could be somehow sure that there is a misspelled word, so I replace that repetition with ''.
So I have tried this:
a_df['corrected_text'] = a_df['text'].str.replace(r'([a-zA-Z])\\3+','')
But there is no change. My logic was to try to capture letters that were repeated, but I must be doing something wrong. Please, any help will be greatly appreciated.
Upvotes: 4
Views: 184
Reputation: 627077
You can use
a_df['text'] = a_df['text'].str.replace(r'([a-zA-Z])\1{2,}', r'\1', regex=True)
Details:
([a-zA-Z])
- capturing group with ID 1\1{2,}
- two or more occurrences (so, three or more letters together with the previous pattern) of Group 1 value (\1
is a replacement backreference to Group 1 value, make sure to use it in a raww string literal, else you would have to double backslashes).Upvotes: 3
Reputation: 26676
Lets try replace every three consecutive vowels with '' as follows
a_df['text'].str.replace('([aeiou]{3})','', regex=True)
Upvotes: 1