Reputation: 333
I am using the below "fastest" way of removing punctuation from a string:
text = file_open.translate(str.maketrans("", "", string.punctuation))
However, it removes all punctuation including apostrophes from tokens such as shouldn't
turning it into shouldnt
.
The problem is I am using NLTK library for stopwords and the standard stopwords don't include such examples without apostrophes but instead have tokens that NLTK would generate if I used the NLTK tokenizer to split my text. For example for shouldnt
the stopwords included are shouldn, shouldn't, t
.
I can either add the additional stopwords or remove the apostrophes from the NLTK stopwords. But both solutions don't seem "correct" in a way as I think the apostrophes should be left when doing punctuation cleaning.
Is there a way I can leave the apostrophes when doing fast punctuation cleaning?
Upvotes: 5
Views: 5483
Reputation: 3908
Edited from this answer.
import re
s = "This is a test string, with punctuation. This shouldn't fail...!"
text = re.sub(r'[^\w\d\s\']+', '', s)
print(text)
This returns:
This is a test string with punctuation This shouldn't fail
Regex explanation:
[^]
matches everything but everything inside the blockquotes
\w
matches any word character (equal to [a-zA-Z0-9_]
)
\d
matches a digit (equal to [0-9]
)
\s
matches any whitespace character (equal to [\r\n\t\f\v ]
)
\'
matches the character '
literally (case sensitive)
+
matches between one and unlimited times, as many times as possible, giving back as needed
And you can try it here.
Upvotes: 3
Reputation: 14233
>>> from string import punctuation
>>> type(punctuation)
<class 'str'>
>>> my_punctuation = punctuation.replace("'", "")
>>> my_punctuation
'!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~'
>>> "It's right, isn't it?".translate(str.maketrans("", "", my_punctuation))
"It's right isn't it"
Upvotes: 8
Reputation: 298
What about using
text = file_open.translate(str.maketrans(",.", " "))
and adding other characters you want to ignore into the first string.
Upvotes: 1