FLang
FLang

Reputation: 21

How can I remove ONLY punctuation marks like "." and ","?

I want to search for specific words like "earnings" or "income". Therefor, I created a wordlist and searched for the words in the text.

However, my code returns no results for the words with an additional punctuation mark like "earnings." or "income,". Now, I want to remove these punctuations without removing a dot in a number like "2.4" or any other marks like "%".

I already tried

table = str.maketrans({key: None for key in string.punctuation})
text_wo_dots = text.translate(table)

and

text_wo_dots = re.sub(r'[^\w\s]',' ',text)

but this removed all punctuation.

Upvotes: 2

Views: 132

Answers (3)

The fourth bird
The fourth bird

Reputation: 163297

You could make use of a negative lookahead (?! and a negative lookbehind (?<! to assert what is directly on the left and what is directly on the right is not a digit:

(?<!\d)[^\w\s]+(?!\d)

Regex demo | Python demo

For example:

import re
text = "income,and 4.6 test"
text_wo_dots = re.sub(r'(?<!\d)[^\w\s]+(?!\d)',' ',text)
print(text_wo_dots) # income and 4.6 test

Upvotes: 0

Ildar Akhmetov
Ildar Akhmetov

Reputation: 1431

Something as simple as this might also work:

[\.,:!?][\n\s]

[\.,:!?] contains some punctuation marks, you can add more if needed, while [\n\s] means that it must be followed by a space or a newline character.

Here is a working example: https://regex101.com/r/TcR6Ct/2

Below is the Python code:

import re

s = 'Bla, bla, bla 7.6 bla.'

pattern = '[\.,:!?][\n\s]'
s = re.sub(pattern, '', s+' ')
print(s)

Upvotes: 0

Lucas
Lucas

Reputation: 395

I suggest, you first split your text into seperate words, including the punctution marks

text = ["This is an example, it contains 1.0 number and some words."]
raw_list = text.split()

Now you can remove the punctuation marks that are at the end of an element.

cleaned_words = []
for word in raw_list:
    if word[-1] in ['.', ',', '!', '?']:
        cleaned_words.append(word[:-1])
    else:
        cleaned_words.append(word)

Note 1: If your text contains numbers like 1. for 1.0 you also need to take the second last character into account and leave the point in if isdigit() evaluates to True
Note 2: if There are sentences that end with multiple punctuation marks you should run a while loop to remove them and then only append once no more ounctuation marks are found.

while True:
    if word[-1] in ['.', ',', '!', '?']:
        word = word[:-1]
    else:
        break

cleaned_words.append(word)

Upvotes: 1

Related Questions