Reputation: 27
In a sentence, How can I remove apostrophe, double quotes, comma and so on for all words excluding words like it's, what's etc.. and at end of the sentence there must be a space between word and full stop.
For example
Input Sentence :
"'This has punctuation, and it's hard to remove. ?"
Desired Output Sentence :
This has punctuation and it's hard to remove .
Upvotes: 0
Views: 3361
Reputation: 71568
I propose this code:
import re
sentences = [""""'This has punctuation, and it's hard to remove. ?" """,
"Did you see Cress' haircut?.",
"This 'thing' hasn't a really bad habit, you know?.",
"'I bought this for $30 from Best Buy it's. What a waste of money! The ear gels are 'comfortable at first, but what's after an hour."]
for s in sentences:
# Remove the specified characters
new_s = re.sub(r"""["?,$!]|'(?!(?<! ')[ts])""", "", s)
# Deal with the final dot
new_s = re.sub(r"\.", " .", new_s)
print(new_s)
Output:
This has punctuation and it's hard to remove .
Did you see Cress haircut .
This thing hasn't a really bad habit you know .
I bought this for 30 from Best Buy it's . What a waste of money The ear gels are comfortable at first but what's after an hour .
The regex:
["?,$!] # Match " ? , $ or !
| # OR
' # A ' if it does not have...
(?!
(?<! ')
[ts] # t or s after it, provided it has no ` '` before the t or s
)
Upvotes: 1
Reputation: 270
Use the string.strip(delimiter) function for the outside quotes
like this :
output = chaine.strip("\"")
Be careful, you have to escape some characters with a '\' like ', ", \, and so on. Or you can enter them as "'", '"' (unsure).
Edit : mmh, didn't think about the apostrophes, if the only problem is the apostrophes you can strip the rest first then parse it manually with a for statement, place indice of first apostrophe found then if followed by an 's', leave it, I don't know, you have to set lexical/semantical rules before coding it.
Edit 2 : If the string is only a sentence, and always has a dot at the end, and always needs the space, then use this at the end :
chaine[:-2]+" "+chaine[-2:]
Upvotes: 0
Reputation: 174756
Use a negative look-behind
(?<!\w)["'?]|,(?= )
REmove the matched '"?
characters through re.sub
.
And your code would be,
>>> s = '\"\'This has punctuation, and it\'s hard to remove. ?\" '
>>> m = re.sub(r'(?<!\w)[\"\'\?]|,(?= )', r'', s)
>>> m
"This has punctuation and it's hard to remove. "
Upvotes: 2
Reputation: 63757
My take for this is, remove all quotations which are at either end of a word. So split the sentences to word (separated by white-space) and strip any leading or trailing quotation marks from the words
>>> ''.join(e.strip(string.punctuation) for e in re.split("(\s)",st))
"This has punctuation and it's hard to remove "
Upvotes: 0
Reputation: 41838
Use this:
(?<)["'?:;,.]
If you also want to leave the period at the end of a line (as long as it is preceded by a space):
(?<)(?<! (?=.$))["'?:;,.]
Upvotes: 0