Reputation: 25
import re
text = 'The quick. black n brown? fox jumps*over the lazy dog.'
print(re.split('; |, |\? |. ',text))
This is giving me an output of:
['Th', 'quick', 'brown', 'fo', 'jumps*ove', 'th', 'laz', 'dog.']
but I want that string to be split as
['The quick.', 'black n brown?', 'fox jumps*over the lazy dog.']
Upvotes: 1
Views: 44
Reputation: 42137
You can leverage a zero-width positive lookbehind here:
re.split('(?<=[;,.?]) ',text)
(?<=[;,.?])
is zero-width positive lookbehind that matches any of ;
, ,
, .
, ?
literally; this is followed by a space to be matchedExample:
In [1461]: text = 'The quick. black n brown? fox jumps*over the lazy dog.'
In [1462]: re.split(r'(?<=[;,.?]) ',text)
Out[1462]: ['The quick.', 'black n brown?', 'fox jumps*over the lazy dog.']
In your try, if you replace .
(any character) with a escaped version to get litaral .
i.e. \.
you would get closer to the desired output:
In [1463]: text = 'The quick. black n brown? fox jumps*over the lazy dog.'
In [1464]: re.split(r'; |, |\? |. ',text)
Out[1464]: ['Th', 'quick', 'blac', '', 'brown', 'fo', 'jumps*ove', 'th', 'laz', 'dog.']
In [1465]: re.split(r'; |, |\? |\. ',text)
Out[1465]: ['The quick', 'black n brown', 'fox jumps*over the lazy dog.']
As all the patterns have single characters followed by a space, you can make the pattern more compact by using character class:
In [1466]: re.split(r'[;,?.] ',text)
Out[1466]: ['The quick', 'black n brown', 'fox jumps*over the lazy dog.']
You don't need to escape Regex tokens inside character class []
.
Also, make Regex patterns raw by enclosing the pattern string with r
.
Upvotes: 0
Reputation: 6234
If I understood what you needed, your regex should have the dot escaped:
print(re.split('; |, |\? |\. ',text)
Upvotes: 1