asleniovas
asleniovas

Reputation: 333

How to strip string from punctuation except apostrophes for NLP

I am using the below "fastest" way of removing punctuation from a string:

text = file_open.translate(str.maketrans("", "", string.punctuation))

However, it removes all punctuation including apostrophes from tokens such as shouldn't turning it into shouldnt.

The problem is I am using NLTK library for stopwords and the standard stopwords don't include such examples without apostrophes but instead have tokens that NLTK would generate if I used the NLTK tokenizer to split my text. For example for shouldnt the stopwords included are shouldn, shouldn't, t.

I can either add the additional stopwords or remove the apostrophes from the NLTK stopwords. But both solutions don't seem "correct" in a way as I think the apostrophes should be left when doing punctuation cleaning.

Is there a way I can leave the apostrophes when doing fast punctuation cleaning?

Upvotes: 5

Views: 5483

Answers (3)

funie200
funie200

Reputation: 3908

Edited from this answer.

import re

s = "This is a test string, with punctuation. This shouldn't fail...!"

text = re.sub(r'[^\w\d\s\']+', '', s)
print(text)

This returns:

This is a test string with punctuation This shouldn't fail

Regex explanation:

[^] matches everything but everything inside the blockquotes
\w matches any word character (equal to [a-zA-Z0-9_])
\d matches a digit (equal to [0-9])
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\' matches the character ' literally (case sensitive)
+ matches between one and unlimited times, as many times as possible, giving back as needed

And you can try it here.

Upvotes: 3

buran
buran

Reputation: 14233

>>> from string import punctuation
>>> type(punctuation)
<class 'str'>
>>> my_punctuation = punctuation.replace("'", "")
>>> my_punctuation
'!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~'
>>> "It's right, isn't it?".translate(str.maketrans("", "", my_punctuation))
"It's right isn't it"

Upvotes: 8

Znerual
Znerual

Reputation: 298

What about using

text = file_open.translate(str.maketrans(",.", "  "))

and adding other characters you want to ignore into the first string.

Upvotes: 1

Related Questions