Reputation: 1122
I have dataset which contains comments of people in Persian and Arabic. Some comments contain words like عاااالی
which is not a real word and the right word is actually عالی
. It's like using woooooooow!
instead of WoW!
.
My intention is to find these words and remove all extra alphabets. the only refrence I found is the code below which removes the words with repeated alphabets:
import re
p = re.compile(r'\s*\b(?=[a-z\d]*([a-z\d])\1{3}|\d+\b)[a-z\d]+', re.IGNORECASE)
s = "df\nAll aaaaaab the best 8965\nUS issssss is 123 good \nqqqq qwerty 1 poiks\nlkjh ggggqwe 1234 aqwe iphone5224s"
strs = s.split("\n")
print([p.sub("", x).strip() for x in strs])
I just need to replace the word with the one that has removed the extra repeated alphabets. you can use this sentence as a test case:
سلاااااام چطووووورین؟ من خیلی گشتم ولی مثل این کیفیت اصلاااااا ندیدممممم.
It has to be like this:
سلام چطورین؟ من خیلی گشتم ولی مثل این کیفیت اصلا ندیدم
please consider that more than 3 repeats are not acceptable.
Upvotes: 1
Views: 43
Reputation: 627327
You may use
re.sub(r'([^\W\d_])\1{2,}', r'\1', s)
It will replace chunks of identical consecutive letters with their single occurrence.
See the regex demo.
Details
([^\W\d_])
- Capturing group 1: any Unicode letter\1{2,}
- two or more repetitions of the same letter that is captured in Group 1.The r'\1'
replacement will only keep a single letter occurrence in the result.
Upvotes: 1