How can I use regex to search unicode texts and find words that contain repeated alphabets?

Question

I have dataset which contains comments of people in Persian and Arabic. Some comments contain words like عاااالی which is not a real word and the right word is actually عالی. It's like using woooooooow! instead of WoW!.

My intention is to find these words and remove all extra alphabets. the only refrence I found is the code below which removes the words with repeated alphabets:

import re
p = re.compile(r'\s*\b(?=[a-z\d]*([a-z\d])\1{3}|\d+\b)[a-z\d]+', re.IGNORECASE)
s = "df
All aaaaaab the best 8965
US issssss is 123 good 
qqqq qwerty 1 poiks
lkjh ggggqwe 1234 aqwe iphone5224s"
strs = s.split("
")                   
print([p.sub("", x).strip() for x in strs])

I just need to replace the word with the one that has removed the extra repeated alphabets. you can use this sentence as a test case:

سلاااااام چطووووورین؟ من خیلی گشتم ولی مثل این کیفیت اصلاااااا ندیدممممم.

It has to be like this:

سلام چطورین؟ من خیلی گشتم ولی مثل این کیفیت اصلا ندیدم

please consider that more than 3 repeats are not acceptable.

Wiktor Stribiżew · Accepted Answer

You may use

re.sub(r'([^\W\d_])\1{2,}', r'\1', s)

It will replace chunks of identical consecutive letters with their single occurrence.

See the regex demo.

Details

([^\W\d_]) - Capturing group 1: any Unicode letter
\1{2,} - two or more repetitions of the same letter that is captured in Group 1.

The r'\1' replacement will only keep a single letter occurrence in the result.

How can I use regex to search unicode texts and find words that contain repeated alphabets?

Answers (1)

Related Questions