Alexis
Alexis

Reputation: 2304

Repeated vowels and consonants in words in pandas

I have the following dataset:

a_df = pd.DataFrame({'id':[1,2,3,4,5],'text':['This was fuuuuun','aaaawesome','Hiiigh altitude','Oops','See you']})

a_df
    id  text
0   1   This was fuuuuun
1   2   aaaawesome
2   3   Hiiigh altitude
3   4   Oops
4   5   See you

Some words are misspelled. One rule to apply is to that, if I see three or more vowels or consonants, then I could be somehow sure that there is a misspelled word, so I replace that repetition with ''.

So I have tried this:

a_df['corrected_text'] = a_df['text'].str.replace(r'([a-zA-Z])\\3+','')

But there is no change. My logic was to try to capture letters that were repeated, but I must be doing something wrong. Please, any help will be greatly appreciated.

Upvotes: 4

Views: 184

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627077

You can use

a_df['text'] = a_df['text'].str.replace(r'([a-zA-Z])\1{2,}', r'\1', regex=True)

Details:

  • ([a-zA-Z]) - capturing group with ID 1
  • \1{2,} - two or more occurrences (so, three or more letters together with the previous pattern) of Group 1 value (\1 is a replacement backreference to Group 1 value, make sure to use it in a raww string literal, else you would have to double backslashes).

Upvotes: 3

wwnde
wwnde

Reputation: 26676

Lets try replace every three consecutive vowels with '' as follows

a_df['text'].str.replace('([aeiou]{3})','', regex=True)

Upvotes: 1

Related Questions