Reputation: 145
I've a problem:
E.x. I have a sentence
s = "AAA? BBB. CCC!"
So, I do:
import string
table = str.maketrans('', '', string.punctuation)
s = [w.translate(table) for w in s]
And it's all right. My new sentence will be:
s = "AAA BBB CCC"
But, if I have input sentence like:
s = "AAA? BBB. CCC! DDD.EEE"
after remove punctuation the same method as below I'll have
s = "AAA BBB CCC DDDEEE"
but need:
s = "AAA BBB CCC DDD EEE"
Is any ideas/methods how to solve this problem?
Upvotes: 9
Views: 37262
Reputation: 948
I know not everyone has this situation, but I am writing an internationalized app and it's a bit heavier lift. This is what I have come up with:
[Edited to add 'import regex'] - Thanks Andj
import regex
random_string = "~`!ќ®†њѓѕў‘“ъйжюёф №%:,)( ЛПМКё…∆≤≥“™ƒђ≈≠»"
clean_string = regex.sub( r'[^\w\s]', '', random_string )
print( clean_string )
Result is:
ќњѓѕўъйжюёф ЛПМКёƒђ
This works with a wide range of alphabets and special characters in many languages. I've tested it on several languages with every special character and a few regular characters on that keyboard. Still need to strip out a few special markers this won't detect.
Straightforward but powerful. Hope that helps someone.
Upvotes: 0
Reputation: 89527
string.punctuation contains following characters:
'!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~'
You can use translate and maketrans functions to map punctuations to empty values (replace)
import string
'AAA? BBB. CCC! DDD.EEE'.translate(str.maketrans('', '', string.punctuation))
Output:
'AAA BBB CCC DDDEEE'
Upvotes: 8
Reputation: 1
Try this:
import string
exclude = set(string.punctuation)
exclude.remove(".")
doc = "AAA? BBB. CCC! DDD.EEE"
for punctuation in exclude:
doc = doc.replace(punctuation,"")
doc = doc.replace("."," ")
doc = doc.split()
print(" ".join(doc))
Upvotes: 0
Reputation: 12202
Use:
import re
" ".join(re.split('\W+', s))
That splits the string on all non-word characters, then joins the individual substrings by single spaces.
Upvotes: 2
Reputation: 82765
This is one approach using str.strip
and a simple iteration.
Ex:
from string import punctuation
s = "AAA? BBB. CCC! DDD.EEE"
def cleanString(strval):
return "".join(" " if i in punctuation else i for i in strval.strip(punctuation))
s = " ".join(cleanString(i) for i in s.split())
print(s)
Output:
AAA BBB CCC DDD EEE
Upvotes: 1
Reputation: 159
You can also do it like this:
punctuation = "!@#$%^&*()_+<>?:.,;" # add whatever you want
s = "AAA? BBB. CCC!"
for c in s:
if c in punctuation:
s = s.replace(c, "")
print(s)
>>> "AAA BBB CCC"
Upvotes: 4
Reputation: 709
Check this out:
if __name__ == "__main__":
test_string = "AAA? BBB. CCC! DDD.EEE"
result = "".join((char if char.isalpha() else " ") for char in test_string)
print(result)
Result: AAA BBB CCC DDD EEE
Upvotes: 0
Reputation: 438
Try this code:
import re
input_str = "AAA? BBB. CCC! DDD.EEE"
output_str = re.sub('[^A-Za-z0-9]+', ' ', input_str)
print output_str
'AAA BBB CCC DDD EEE'
Upvotes: 5