Reputation: 65
I have a text like this Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan Eksioglu,Handsome cello wrapped hard magnet, Ideal for home or office.
I removed punctuations from this text by the following code.
import string
string.punctuation
def remove_punctuation(text):
punctuationfree="".join([i for i in text if i not in string.punctuation])
return punctuationfree
#storing the puntuation free text
df_Train['BULLET_POINTS']= df_Train['BULLET_POINTS'].apply(lambda x:remove_punctuation(x))
df_Train.head()
here in the above code df_Train
is a pandas dataframe in which "BULLET_POINTS" column contains the kind of text data mentioned above.
The result I got is Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan EksiogluHandsome cello wrapped hard magnet Ideal for home or office
Notice how two words Eksioglu
and Handsome
are combing due to no space after ,
. I need a way to overcome this issue.
Upvotes: 1
Views: 1739
Reputation: 627292
In these case, it makes sense to replace all the special chars with a space, and then strip the result and shrink multiple spaces to a single space:
df['BULLET_POINTS'] = df['BULLET_POINTS'].str.replace(r'(?:[^\w\s]|_)+', ' ', regex=True).str.strip()
Or, if you have chunks of punctuation + whitespace to handle:
df['BULLET_POINTS'].str.replace(r'[\W_]+', ' ', regex=True).str.strip()
Output:
>>> df['BULLET_POINTS'].str.replace(r'(?:[^\w\s]|_)+', ' ', regex=True).str.strip()
0 Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan Eksioglu Handsome cello wrapped hard magnet Ideal for home or office
Name: BULLET_POINTS, dtype: object
The (?:[^\w\s]|_)+
regex matches one or more occurrences of any char other than word and whitespace chars or underscores (i.e. one or more non-alphanumeric chars), and replaces them with a space.
The [\W_]+
pattern is similar but includes whitespace.
The .str.strip()
part is necessary as the replacement might result in leading/trailing spaces.
Upvotes: 3