Yadnesh Salvi
Yadnesh Salvi

Reputation: 195

Split by '.' when not preceded by digit

I want to split '10.1 This is a sentence. Another sentence.' as ['10.1 This is a sentence', 'Another sentence'] and split '10.1. This is a sentence. Another sentence.' as ['10.1. This is a sentence', 'Another sentence']

I have tried

s.split(r'\D.\D')

It doesn't work, how can this be solved?

Upvotes: 4

Views: 780

Answers (3)

OysterShucker
OysterShucker

Reputation: 5541

All sentences (except the very last one) end with a period followed by space, so split on that. Worrying about the clause number is backwards. You could potentially find all kinds of situations that you DON'T want, but it is generally much easier to describe the situation that you DO want. In this case '. ' is that situation.

import re

doc = '10.1 This is a sentence. Another sentence.'

def sentences(doc):
    #split all sentences
    s = re.split(r'\.\s+', doc)

    #remove empty index or remove period from absolute last index, if present
    if s[-1] == '':
        s     = s[0:-1]
    elif s[-1].endswith('.'):
        s[-1] = s[-1][:-1]

    #return sentences
    return s

print(sentences(doc))

The way I structured my regex it should also eliminate arbitrary whitespace between paragraphs.

Upvotes: 0

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627607

If you plan to split a string on a . char that is not preceded or followed with a digit, and that is not at the end of the string a splitting approach might work for you:

re.split(r'(?<!\d)\.(?!\d|$)', text)

See the regex demo.

If your strings can contain more special cases, you could use a more customizable extracting approach:

re.findall(r'(?:\d+(?:\.\d+)*\.?|[^.])+', text)

See this regex demo. Details:

  • (?:\d+(?:\.\d+)*\.?|[^.])+ - a non-capturing group that matches one or more occurrences of
    • \d+(?:\.\d+)*\.? - one or more digits (\d+), then zero or more sequences of . and one or more digits ((?:\.\d+)*) and then an optional . char (\.?)
    • | - or
    • [^.] - any char other than a . char.

Upvotes: 1

Bharel
Bharel

Reputation: 26991

You have multiple issues:

  1. You're not using re.split(), you're using str.split().
  2. You haven't escaped the ., use \. instead.
  3. You're not using lookahead and lookbehinds so your 3 characters are gone.

Fixed code:

>>> import re
>>> s = '10.1 This is a sentence. Another sentence.'
>>> re.split(r"(?<=\D\.)(?=\D)", s)
['10.1 This is a sentence.', ' Another sentence.']

Basically, (?<=\D\.) finds a position right after a . that has a non-digit character. (?=\D) then makes sure there's a non digit after the current position. When everything applies, it splits correctly.

Upvotes: -1

Related Questions