oarfish
oarfish

Reputation: 4622

How can I split at word boundaries with regexes?

I'm trying to do this:

import re
sentence = "How are you?"
print(re.split(r'\b', sentence))

The result being

[u'How are you?']

I want something like [u'How', u'are', u'you', u'?']. How can this be achieved?

Upvotes: 7

Views: 6600

Answers (3)

Vishal Kumar Sahu
Vishal Kumar Sahu

Reputation: 1396

Here is my approach to split on word boundaries:

re.split(r"\b\W\b", "How are you?") # Reprocess list to split on special characters.
# Result: ['How', 'are', 'you?']

and using findall on word boundaries

re.findall(r"\b\w+\b", "How are you?")
# Result: ['How', 'are', 'you']

Upvotes: 1

Pedro Lobito
Pedro Lobito

Reputation: 98871

import re
split = re.findall(r"[\w']+|[.,!?;]", "How are you?")
print(split)

Output:

['How', 'are', 'you', '?']

Ideone Demo

Regex101 Demo


Regex Explanation:

"[\w']+|[.,!?;]"

    1st Alternative: [\w']+
        [\w']+ match a single character present in the list below
            Quantifier: + Between one and unlimited times, as many times as possible, giving back as needed [greedy]
            \w match any word character [a-zA-Z0-9_]
            ' the literal character '
    2nd Alternative: [.,!?;]
        [.,!?;] match a single character present in the list below
            .,!?; a single character in the list .,!?; literally

Upvotes: 2

Kenny Lau
Kenny Lau

Reputation: 465

Unfortunately, Python cannot split by empty strings.

To get around this, you would need to use findall instead of split.

Actually \b just means word boundary.

It is equivalent to (?<=\w)(?=\W)|(?<=\W)(?=\w).

That means, the following code would work:

import re
sentence = "How are you?"
print(re.findall(r'\w+|\W+', sentence))

Upvotes: 15

Related Questions