Jeremy Silva
Jeremy Silva

Reputation: 33

How to split sentence including punctuation

If I had the sentence sentence = 'There is light!' and I was to split this sentence with mysentence = sentence.split(), how would I have the output as 'There, is, light, !' of print(mysentence)? What I specifically wanted to do was split the sentence including all punctuation, or just a list of selected punctuation. I got some code but the program is recognizing the characters in the word, not the word.

out = "".join(c for c in punct1 if c not in ('!','.',':'))
out2 = "".join(c for c in punct2 if c not in ('!','.',':'))
out3 = "".join(c for c in punct3 if c not in ('!','.',':'))

How would I use this without recognizing each character in a word, but the word itself. Therefore, the output of "Hello how are you?" should become "Hello, how, are, you, ?" Any way of doing this

Upvotes: 3

Views: 2887

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626758

You may use a \w+|[^\w\s]+ regex with re.findall to get those chunks:

\w+|[^\w\s]

See the regex demo

Pattern details:

  • \w+ - 1 or more word chars (letters, digits or underscores)
  • | - or
  • [^\w\s] - 1 char other than word / whitespace

Python demo:

import re
p = re.compile(r'\w+|[^\w\s]')
s = "There is light!"
print(p.findall(s))

NOTE: If you want to treat an underscore as punctuation, you need to use something like [a-zA-Z0-9]+|[^A-Za-z0-9\s] pattern.

UPDATE (after comments)

To make sure you match an apostrophe as part of the words, add (?:'\w+)* or (?:'\w+)? to the \w+ in the pattern above:

import re
p = re.compile(r"\w+(?:'\w+)*|[^\w\s]")
s = "There is light!? I'm a human"
print(p.findall(s))

See the updated demo

The (?:'\w+)* matches zero or more (*, if you use ?, it will match 1 or 0) occurrences of an apostrophe followed with 1+ word characters.

Upvotes: 3

Related Questions