replace punctuation with space in text

Question

I have a text like this Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan Eksioglu,Handsome cello wrapped hard magnet, Ideal for home or office. I removed punctuations from this text by the following code.

import string
string.punctuation
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree
#storing the puntuation free text
df_Train['BULLET_POINTS']= df_Train['BULLET_POINTS'].apply(lambda x:remove_punctuation(x))
df_Train.head()

here in the above code df_Train is a pandas dataframe in which "BULLET_POINTS" column contains the kind of text data mentioned above. The result I got is Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan EksiogluHandsome cello wrapped hard magnet Ideal for home or office Notice how two words Eksioglu and Handsome are combing due to no space after , . I need a way to overcome this issue.

Wiktor Stribiżew · Accepted Answer

In these case, it makes sense to replace all the special chars with a space, and then strip the result and shrink multiple spaces to a single space:

df['BULLET_POINTS'] = df['BULLET_POINTS'].str.replace(r'(?:[^\w\s]|_)+', ' ', regex=True).str.strip()

Or, if you have chunks of punctuation + whitespace to handle:

df['BULLET_POINTS'].str.replace(r'[\W_]+', ' ', regex=True).str.strip()

Output:

>>> df['BULLET_POINTS'].str.replace(r'(?:[^\w\s]|_)+', ' ', regex=True).str.strip()
0    Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan Eksioglu Handsome cello wrapped hard magnet  Ideal for home or office
Name: BULLET_POINTS, dtype: object

The (?:[^\w\s]|_)+ regex matches one or more occurrences of any char other than word and whitespace chars or underscores (i.e. one or more non-alphanumeric chars), and replaces them with a space.

The [\W_]+ pattern is similar but includes whitespace.

The .str.strip() part is necessary as the replacement might result in leading/trailing spaces.

replace punctuation with space in text

Answers (1)

Related Questions