Reputation: 21
I want to search for specific words like "earnings" or "income". Therefor, I created a wordlist and searched for the words in the text.
However, my code returns no results for the words with an additional punctuation mark like "earnings." or "income,". Now, I want to remove these punctuations without removing a dot in a number like "2.4" or any other marks like "%".
I already tried
table = str.maketrans({key: None for key in string.punctuation})
text_wo_dots = text.translate(table)
and
text_wo_dots = re.sub(r'[^\w\s]',' ',text)
but this removed all punctuation.
Upvotes: 2
Views: 132
Reputation: 163297
You could make use of a negative lookahead (?!
and a negative lookbehind (?<!
to assert what is directly on the left and what is directly on the right is not a digit:
(?<!\d)[^\w\s]+(?!\d)
For example:
import re
text = "income,and 4.6 test"
text_wo_dots = re.sub(r'(?<!\d)[^\w\s]+(?!\d)',' ',text)
print(text_wo_dots) # income and 4.6 test
Upvotes: 0
Reputation: 1431
Something as simple as this might also work:
[\.,:!?][\n\s]
[\.,:!?]
contains some punctuation marks, you can add more if needed, while [\n\s]
means that it must be followed by a space or a newline character.
Here is a working example: https://regex101.com/r/TcR6Ct/2
Below is the Python code:
import re
s = 'Bla, bla, bla 7.6 bla.'
pattern = '[\.,:!?][\n\s]'
s = re.sub(pattern, '', s+' ')
print(s)
Upvotes: 0
Reputation: 395
I suggest, you first split your text into seperate words, including the punctution marks
text = ["This is an example, it contains 1.0 number and some words."]
raw_list = text.split()
Now you can remove the punctuation marks that are at the end of an element.
cleaned_words = []
for word in raw_list:
if word[-1] in ['.', ',', '!', '?']:
cleaned_words.append(word[:-1])
else:
cleaned_words.append(word)
Note 1: If your text contains numbers like 1.
for 1.0
you also need to take the second last character into account and leave the point in if isdigit()
evaluates to True
Note 2: if There are sentences that end with multiple punctuation marks you should run a while loop to remove them and then only append once no more ounctuation marks are found.
while True:
if word[-1] in ['.', ',', '!', '?']:
word = word[:-1]
else:
break
cleaned_words.append(word)
Upvotes: 1