Reputation: 31
A sentence will be a sequence of characters that:
I have a file that contains the following text:
this is the\nfirst sentence. Isn't\nit? Yes ! !! This \n\nlast bit :) is also a sentence, but \nwithout a terminator other than the end of the file\n
By the above definition, there are four "sentences" in it:
this is the\nfirst sentence
Isn't\nit
Yes
This \n\nlast bit :) is also a sentence, but \nwithout a terminator other than the end of the file
Notice that:
This is what I have at the moment (.*\n+)
and don't know how to refine it.
Please I need your help for a regex that deconstruct the text into above and return a list. Thanking you in advance for your help.
Upvotes: 1
Views: 76
Reputation: 4554
Try this:
re.split('(\!\s\!+)|\.|\?',s)
['this is the\nfirst sentence', " Isn't\nit", ' Yes ', ' This \n\nlast bit :) is also a sentence, but \nwithout a terminator other than the end of the file\n']
Upvotes: 0
Reputation: 626691
The following is not something that will work for everyone, but it works for your specific input. You may further tweak this expression:
([^!?.]+)[!?.\s]*(?![!?.])
See the regex demo.
Details:
([^!?.]+)
- Capturing group 1 matching 1 or more chars other than !
, ?
, .
[!?.\s]*
- 0 or more !
, ?
, .
, whitespaces(?![!?.])
- that are not followed with !
, ?
or .
.In Python, you need to use it with re.findall
that will only fetch the substrings captured with capturing groups:
import re
rx = r"([^!?.]+)[!?.\s]*(?![!?.])"
s = "this is the\nfirst sentence. Isn't\nit? Yes ! !! This \n\nlast bit :) is also a sentence, but \nwithout a terminator other than the end of the file\n"
sents = re.findall(rx, s)
print(sents)
# => ['this is the\nfirst sentence',
"Isn't\nit",
'Yes ',
'This \n\nlast bit :) is also a sentence, but \nwithout a terminator other than the end of the file\n'
]
See Python demo
Upvotes: 1