Reputation: 195
I want to split '10.1 This is a sentence. Another sentence.'
as ['10.1 This is a sentence', 'Another sentence']
and split '10.1. This is a sentence. Another sentence.'
as ['10.1. This is a sentence', 'Another sentence']
I have tried
s.split(r'\D.\D')
It doesn't work, how can this be solved?
Upvotes: 4
Views: 780
Reputation: 5541
All sentences (except the very last one) end with a period followed by space, so split on that. Worrying about the clause number is backwards. You could potentially find all kinds of situations that you DON'T want, but it is generally much easier to describe the situation that you DO want. In this case '. ' is that situation.
import re
doc = '10.1 This is a sentence. Another sentence.'
def sentences(doc):
#split all sentences
s = re.split(r'\.\s+', doc)
#remove empty index or remove period from absolute last index, if present
if s[-1] == '':
s = s[0:-1]
elif s[-1].endswith('.'):
s[-1] = s[-1][:-1]
#return sentences
return s
print(sentences(doc))
The way I structured my regex
it should also eliminate arbitrary whitespace between paragraphs.
Upvotes: 0
Reputation: 627607
If you plan to split a string on a .
char that is not preceded or followed with a digit, and that is not at the end of the string a splitting approach might work for you:
re.split(r'(?<!\d)\.(?!\d|$)', text)
See the regex demo.
If your strings can contain more special cases, you could use a more customizable extracting approach:
re.findall(r'(?:\d+(?:\.\d+)*\.?|[^.])+', text)
See this regex demo. Details:
(?:\d+(?:\.\d+)*\.?|[^.])+
- a non-capturing group that matches one or more occurrences of
\d+(?:\.\d+)*\.?
- one or more digits (\d+
), then zero or more sequences of .
and one or more digits ((?:\.\d+)*
) and then an optional .
char (\.?
)|
- or[^.]
- any char other than a .
char.Upvotes: 1
Reputation: 26991
You have multiple issues:
re.split()
, you're using str.split()
..
, use \.
instead.Fixed code:
>>> import re
>>> s = '10.1 This is a sentence. Another sentence.'
>>> re.split(r"(?<=\D\.)(?=\D)", s)
['10.1 This is a sentence.', ' Another sentence.']
Basically, (?<=\D\.)
finds a position right after a .
that has a non-digit character. (?=\D)
then makes sure there's a non digit after the current position. When everything applies, it splits correctly.
Upvotes: -1