mssd21
mssd21

Reputation: 25

the separators are not working properly in regular expression split method

import re
text = 'The quick. black n brown? fox jumps*over the lazy dog.'
print(re.split('; |, |\? |. ',text))

This is giving me an output of:

['Th', 'quick', 'brown', 'fo', 'jumps*ove', 'th', 'laz', 'dog.']

but I want that string to be split as

['The quick.', 'black n brown?', 'fox jumps*over the lazy dog.']

Upvotes: 1

Views: 44

Answers (2)

heemayl
heemayl

Reputation: 42137

You can leverage a zero-width positive lookbehind here:

re.split('(?<=[;,.?]) ',text)
  • (?<=[;,.?]) is zero-width positive lookbehind that matches any of ;, ,, ., ? literally; this is followed by a space to be matched

Example:

In [1461]: text = 'The quick. black n brown? fox jumps*over the lazy dog.'                                                                                                                                  

In [1462]: re.split(r'(?<=[;,.?]) ',text)                                                                                                                                                                    
Out[1462]: ['The quick.', 'black n brown?', 'fox jumps*over the lazy dog.']

In your try, if you replace . (any character) with a escaped version to get litaral . i.e. \. you would get closer to the desired output:

In [1463]: text = 'The quick. black n brown? fox jumps*over the lazy dog.'                                                                                                                                  

In [1464]: re.split(r'; |, |\? |. ',text)                                                                                                                                                                    
Out[1464]: ['Th', 'quick', 'blac', '', 'brown', 'fo', 'jumps*ove', 'th', 'laz', 'dog.']

In [1465]: re.split(r'; |, |\? |\. ',text)                                                                                                                                                                   
Out[1465]: ['The quick', 'black n brown', 'fox jumps*over the lazy dog.']

As all the patterns have single characters followed by a space, you can make the pattern more compact by using character class:

In [1466]: re.split(r'[;,?.] ',text)                                                                                                                                                                        
Out[1466]: ['The quick', 'black n brown', 'fox jumps*over the lazy dog.']

You don't need to escape Regex tokens inside character class [].

Also, make Regex patterns raw by enclosing the pattern string with r.

Upvotes: 0

Yennefer
Yennefer

Reputation: 6234

If I understood what you needed, your regex should have the dot escaped:

print(re.split('; |, |\? |\. ',text)

Upvotes: 1

Related Questions