Reputation: 642
I have the following text:
'The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war? Is AI a bad thing?'
And I'm trying to split it into sentences using the following regex:
re.split(r'[\.\?\!][\s\n]', text.strip())
For some reason it is not deleting the last question mark. The result I get is the following:
['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing?']
I tried modifying a little bit the regex adding a "*" at the end:
re.split(r'[\.\?\!][\s\n]*', text.strip())
But this is what I get:
['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing', '']<------- See this empty string
How should I go about this? I cannot use NLTK here, I am required to use only python 3 regex.
Upvotes: 1
Views: 103
Reputation: 1236
It seems from the split()
function's nature which devide a string into two by separator
(or delimiter
). When separator is appeared at the starting(or ending) position of the string, this pecular behavior which create an empty string by-product of processing of splitting may occurr.
To avoid or remove empty strings of this type, you may use another functions: filter()
function to remove empty strings, or re.match()
and re.findall()
, etc.. like follows to avoide empty string elements from splitting.
a regex for the seperator
[\.\?\!](?:[\s]+|$)
- Using filter()
function to remove empty string elements from splitting, or using re.findall()
function to capture strings except separator
.
ss="""The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war? Is AI a bad thing?"""
splt= re.split(r"[\.\?\!](?:[\s]+|$)",ss)
splt=list(filter(None,splt))
print(splt)
regs= re.compile(r'((?:(?![\.\?\!](?:[\s]+|$)).)*)[\.\?\!](?:[\s]+|$)')
match= regs.findall(ss)
print(match)
Demo for capturing regex which used in the findall()
((?:(?![\.\?\!](?:[\s]+|$)).)*)[\.\?\!](?:[\s]+|$)
script execution result is
['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing']
['The first time you see The Second Renaissance it may look boring', 'Look at it at least twice and definitely watch part 2', 'It will change your view of the matrix', 'Are the human people the ones who started the war', 'Is AI a bad thing']
Upvotes: 1
Reputation: 10360
You are getting the last element as blank because your regex [\.\?\!][\s\n]*
matches the last ?
due to which split operation is performed on that ?
which gives you 2 strings - one present to the left of that ?
and one present at the right. The string present at the right of the last ?
is a blank string, hence you get the last element of the array blank.
Instead of splitting, you can get the matches by using the following regex:
[^.?!]+
See the Python Code output here
Upvotes: 1