Python 3 regex: spliting sentences not working properly

Question

I have the following text:

'The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war? Is AI a bad thing?'

And I'm trying to split it into sentences using the following regex:

re.split(r'[\.\?\!][\s
]', text.strip())

For some reason it is not deleting the last question mark. The result I get is the following:

['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing?']

I tried modifying a little bit the regex adding a "*" at the end:

re.split(r'[\.\?\!][\s
]*', text.strip())

But this is what I get:

['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing', '']<------- See this empty string

How should I go about this? I cannot use NLTK here, I am required to use only python 3 regex.

Thm Lee · Accepted Answer

It seems from the split() function's nature which devide a string into two by separator(or delimiter). When separator is appeared at the starting(or ending) position of the string, this pecular behavior which create an empty string by-product of processing of splitting may occurr.

To avoid or remove empty strings of this type, you may use another functions: filter() function to remove empty strings, or re.match() and re.findall() , etc.. like follows to avoide empty string elements from splitting.

a regex for the seperator

[\.\?\!](?:[\s]+|$)

- Using filter() function to remove empty string elements from splitting, or using re.findall() function to capture strings except separator.

ss="""The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war? Is AI a bad thing?"""

splt= re.split(r"[\.\?\!](?:[\s]+|$)",ss)
splt=list(filter(None,splt))
print(splt)


regs= re.compile(r'((?:(?![\.\?\!](?:[\s]+|$)).)*)[\.\?\!](?:[\s]+|$)')
match= regs.findall(ss)
print(match)

Demo for capturing regex which used in the findall()

((?:(?![\.\?\!](?:[\s]+|$)).)*)[\.\?\!](?:[\s]+|$)

script execution result is

['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing']
['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing']

Python 3 regex: spliting sentences not working properly

Answers (2)

Related Questions