Sirius
Sirius

Reputation: 736

Splitting Paragraphs in Python using Regular Expression containing abbreaviations

Tried using this function on a paragraph consisting of 3 strings and abbreviations.

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

def splitParagraphIntoSentences(paragraph):
    ''' break a paragraph into sentences
        and return a list '''
    import re
    # to split by multile characters

    #   regular expressions are easiest (and fastest)
    sentenceEnders = re.compile('[.!?][\s]{1,2}[A-Z]')
    sentenceList = sentenceEnders.split(paragraph)
    return sentenceList

if __name__ == '__main__':
    p = "While other species (e.g. horse mango, M. foetida) are also grown ,Mangifera indica – the common mango or Indian mango – is the only mango tree. Commonly cultivated in many tropical and subtropical regions, and its fruit is distributed essentially worldwide.In several cultures, its fruit and leaves are ritually used as floral decorations at weddings, public celebrations and religious "

    sentences = splitParagraphIntoSentences(p)
    for s in sentences:
        print s.strip()

The first character of the next beggining sentence is eliminated,

O/p Recieved:
 While other Mangifera species (e.g. horse mango, M. foetida) are also grown on a
 more localized basis, Mangifera indica ΓÇô the common mango or Indian mango ΓÇô
 is the only mango tree
ommonly cultivated in many tropical and subtropical regions, and its fruit is di
stributed essentially worldwide.In several cultures, its fruit and leaves are ri
tually used as floral decorations at weddings, public celebrations and religious.

Thus the string got spliited into only 2 strings and the first character of the next sentence got eliminated.Also some strange charactes can be seen, I guess python wasn`t able to convert the hypen.

Incase I alter the regex to [.!?][\s]{1,2}

While other species (e.g
horse mango, M
foetida) are also grown ,Mangifera indica ΓÇô the common mango or Indian mango Γ
Çô is the only mango tree
Commonly cultivated in many tropical and subtropical regions, and its fruit is d
istributed essentially worldwide.In several cultures, its fruit and leaves are r
itually used as floral decorations at weddings, public celebrations and religiou
s

Thus even the abbreviations get splitted.

Upvotes: 0

Views: 1963

Answers (1)

agf
agf

Reputation: 176800

The regex you want is:

[.!?][\s]{1,2}(?=[A-Z])

You want a positive lookahead assertion, which means you want to match the pattern if it's followed by a capital letter, but not match the capital letter.

The reason only the first one got matched is you don't have a space after the 2nd period.

Upvotes: 2

Related Questions