Reputation: 567
I am trying to do some pattern match regular regression problems for my aspect-based sentiment analysis. I can not handle the punctuation in the right position after the pattern match.
def extra_expression1(txt):
txt= str(txt)
nlp=spacy.load("en_core_web_sm")
txt=nlp(txt)
punc=''
a=len(txt)
for token in txt:
if (token.is_punct==False):
txt=str(txt)
txt=re.sub('goo+d+[^a-z]','good',txt) #"goooodddd" to "good"
a=a-1
else:
punc=token.text
if (a!=0):
txt=str(txt) + str(punc)
punc=''
else:
txt=str(txt) + str(punc)
a=a-1
return txt
and
txt1=["hotel is goood! breakfast was bad."]
df_22=pd.DataFrame(
{
'clean_review' : txt1
}
)
display(df_22)
for i,txt in enumerate(df_22['clean_review']):
txt1= extra_expression1(txt)
df_22['clean_review'].iloc[i]=txt1
df_22
the output is(last one after the process):
How can I solve this?
Upvotes: 2
Views: 113
Reputation: 12711
Can't run your code but the character following good
shouldn't be part of the match. Try using lookahead:
txt=re.sub('goo+d+(?=[^a-z])','good',txt) #"goooodddd" to "good"
Upvotes: 2