Reputation: 33
If I had the sentence sentence = 'There is light!'
and I was to split this sentence with mysentence = sentence.split()
, how would I have the output as 'There, is, light, !'
of print(mysentence)
? What I specifically wanted to do was split the sentence including all punctuation, or just a list of selected punctuation. I got some code but the program is recognizing the characters in the word, not the word.
out = "".join(c for c in punct1 if c not in ('!','.',':'))
out2 = "".join(c for c in punct2 if c not in ('!','.',':'))
out3 = "".join(c for c in punct3 if c not in ('!','.',':'))
How would I use this without recognizing each character in a word, but the word itself. Therefore, the output of "Hello how are you?"
should become "Hello, how, are, you, ?"
Any way of doing this
Upvotes: 3
Views: 2887
Reputation: 626758
You may use a \w+|[^\w\s]+
regex with re.findall
to get those chunks:
\w+|[^\w\s]
See the regex demo
Pattern details:
\w+
- 1 or more word chars (letters, digits or underscores)|
- or[^\w\s]
- 1 char other than word / whitespaceimport re
p = re.compile(r'\w+|[^\w\s]')
s = "There is light!"
print(p.findall(s))
NOTE: If you want to treat an underscore as punctuation, you need to use something like [a-zA-Z0-9]+|[^A-Za-z0-9\s]
pattern.
UPDATE (after comments)
To make sure you match an apostrophe as part of the words, add (?:'\w+)*
or (?:'\w+)?
to the \w+
in the pattern above:
import re
p = re.compile(r"\w+(?:'\w+)*|[^\w\s]")
s = "There is light!? I'm a human"
print(p.findall(s))
See the updated demo
The (?:'\w+)*
matches zero or more (*
, if you use ?
, it will match 1 or 0) occurrences of an apostrophe followed with 1+ word characters.
Upvotes: 3