A sentence will be a sequence of characters that: is terminated by (but does not include) the characters ! ? . or the end of the file excludes whitespace on either end, and is not empty I have a file that contains the following text: this is the\nfirst sentence. Isn't\nit? Yes ! !! This \n\nlast bit :) is also a sentence, but \nwithout a terminator other than the end of the file\n By the above definition, there are four "sentences" in it: Sentence 1: this is the\nfirst sentence Sentence 2: Isn't\nit Sentence 3: Yes Sentence 4: This \n\nlast bit :) is also a sentence, but \nwithout a terminator other than the end of the file Notice that: The sentences do not include their terminator character. The last sentence was not terminated by a character; it finishes with the end of the file. Sentences can span multiple lines of the file. This is what I have at the moment (.*\n+) and don't know how to refine it. Please I need your help for a regex that deconstruct the text into above and return a list. Thanking you in advance for your help.

Reputation: 31

Regex construction to obtain sentences from a text - Python

A sentence will be a sequence of characters that:

is terminated by (but does not include) the characters ! ? . or the end of the file
excludes whitespace on either end, and
is not empty

I have a file that contains the following text:

this is the\nfirst sentence. Isn't\nit? Yes ! !! This \n\nlast bit :) is also a sentence, but \nwithout a terminator other than the end of the file\n

By the above definition, there are four "sentences" in it:

Sentence 1: this is the\nfirst sentence
Sentence 2: Isn't\nit
Sentence 3: Yes
Sentence 4: This \n\nlast bit :) is also a sentence, but \nwithout a terminator other than the end of the file

Notice that:

The sentences do not include their terminator character.
The last sentence was not terminated by a character; it finishes with the end of the file.
Sentences can span multiple lines of the file.

This is what I have at the moment (.*\n+) and don't know how to refine it.

Please I need your help for a regex that deconstruct the text into above and return a list. Thanking you in advance for your help.

Upvotes: 1

Answers (2)

Shenglin Chen

Reputation: 4554

Try this:

re.split('(\!\s\!+)|\.|\?',s)
['this is the\nfirst sentence', " Isn't\nit", ' Yes ', ' This \n\nlast bit :) is also a sentence, but \nwithout a terminator other than the end of the file\n']

Upvotes: 0

Wiktor Stribiżew

Reputation: 626691

The following is not something that will work for everyone, but it works for your specific input. You may further tweak this expression:

([^!?.]+)[!?.\s]*(?![!?.])

See the regex demo.

Details:

([^!?.]+) - Capturing group 1 matching 1 or more chars other than !, ?, .
[!?.\s]* - 0 or more !, ?, ., whitespaces
(?![!?.]) - that are not followed with !, ? or ..

In Python, you need to use it with re.findall that will only fetch the substrings captured with capturing groups:

import re
rx = r"([^!?.]+)[!?.\s]*(?![!?.])"
s = "this is the\nfirst sentence. Isn't\nit? Yes ! !! This \n\nlast bit :) is also a sentence, but \nwithout a terminator other than the end of the file\n"
sents = re.findall(rx, s)
print(sents)
# => ['this is the\nfirst sentence', 
      "Isn't\nit", 
      'Yes ', 
      'This \n\nlast bit :) is also a sentence, but \nwithout a terminator other than the end of the file\n'
     ]

See Python demo

Upvotes: 1

Regex construction to obtain sentences from a text - Python

Answers (2)

Related Questions