Henrik Petterson
Henrik Petterson

Reputation: 7094

Split string into sentences using regex

I have random text stored in $sentences. Using regex, I want to split the text into sentences, see:

function splitSentences($text) {
    $re = '/                # Split sentences on whitespace between them.
        (?<=                # Begin positive lookbehind.
          [.!?]             # Either an end of sentence punct,
        | [.!?][\'"]        # or end of sentence punct and quote.
        )                   # End positive lookbehind.
        (?<!                # Begin negative lookbehind.
          Mr\.              # Skip either "Mr."
        | Mrs\.             # or "Mrs.",
        | T\.V\.A\.         # or "T.V.A.",
                            # or... (you get the idea).
        )                   # End negative lookbehind.
        \s+                 # Split on whitespace between sentences.
        /ix';

    $sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
    return $sentences;
}

$sentences = splitSentences($sentences);

print_r($sentences);

It works fine.

However, it doesn't split into sentences if there are unicode characters:

$sentences = 'Entertainment media properties. Fairy Tail and Tokyo Ghoul.';

Or this scenario:

$sentences = "Entertainment media properties.&Acirc;&nbsp; Fairy Tail and Tokyo Ghoul.";

What can I do to make it work when unicode characters exist in the text?

Here is an ideone for testing.

Bounty info

I am looking for a complete solution to this. Before posting an answer, please read the comment thread I had with WiktorStribiżew for more relevant info on this issue.

Upvotes: 24

Views: 3631

Answers (7)

Jonathan Rowley
Jonathan Rowley

Reputation: 21

I know this question is old and has been nicely answer by @ndnenkov but I figured i could clean up the PHP and make it more efficient since it was really slow on large bodies of text.

Here are my updates:

function sentence_split($text) {
    // put regex tests into an easier to read array
    $regexes = array(
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
            "after"=>'/\A(?:)/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
            "after"=>'/\A(?:[\p{N}\p{Ll}])/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
            "after"=>'/\A(?:[^\p{Lu}])/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs)\.\s))\Z/su',
            "after"=>'/\A(?:[^\p{Lu}]|I)/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:\b[Ee]tc\.\s))\Z/su',
            "after"=>'/\A(?:[^p{Lu}])/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
            "after"=>'/\A(?:\p{Ll})/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:\b\p{L}\.))\Z/su',
            "after"=>'/\A(?:\p{L}\.)/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:\b\p{L}\.\s))\Z/su',
            "after"=>'/\A(?:\p{L}\.\s)/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
            "after"=>'/\A(?:\p{N})/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:[\"”\']\s*))\Z/su',
            "after"=>'/\A(?:\s*\p{Ll})/su'
        ],
        [
            "is_sentence_boundary"=>true,
            "before"=>'/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
            "after"=>'/\A(?:)/su'
        ],
        [
            "is_sentence_boundary"=>true,
            "before"=>'/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
            "after"=>'/\A(?:\p{Lu}[^\p{Lu}])/su'
        ],
        [
            "is_sentence_boundary"=>true,
            "before"=>'/(?:(?:\s\p{L}[\.!?…]\s))\Z/su',
            "after"=>'/\A(?:\p{Lu}\p{Ll})/su'
        ]
    );

    $sentences = array();
    $sentence = '';
    $before = '';
    $testLen = 10; // Used to set before/after chunk sizes. 10 seems to be the smallest that works the best.
    $after = substr($text, 0, $testLen); // start with the first set of chars.

    while($text != '') {
        // run regex tests
        foreach($regexes as $reg) {
            if(preg_match($reg["before"], $before) && preg_match($reg["after"], $after)) {
                // if this passes a sentence ending test then add to the array
                if($reg["is_sentence_boundary"]) {
                    $sentences[] = $sentence;
                    $sentence = '';
                }
                break;
            }
        }

        // add the char to the sentence
        $sentence .= $after[0];

        // eat at text until empty to end loop
        $text = substr($text, 1);

        // add a char behind the before var and then remove the first char
        $before = substr($before.$after[0], -$testLen);

        // create a new after with the first chars from the text
        $after = substr($text, 0, $testLen);

    }

    if($sentence != '') {
        $sentences[] = $sentence . $after;
    }
    return $sentences;
}
$text = "Mr. Entertainment media properties. Fairy Tail 3.5 and Tokyo Ghoul.";
print_r(sentence_split($text));

Upvotes: 2

ndnenkov
ndnenkov

Reputation: 36110

As it should be expected, any sort of natural language processing is not a trivial task. The reason for it is that they are evolutionary systems. There is no single person who sat down and thought about which are good ideas and which - not. Every rule has 20-40% exceptions. With that said the complexity of a single regex that can do your bidding would be off the charts. Still, the following solution relies mainly on regexes.


  • The idea is to gradually go over the text.
  • At any given time, the current chunk of the text will be contained in two different parts. One, which is the candidate for a substring before a sentence boundary and another - after.
  • The first 10 regex pairs detect positions which look like sentence boundaries, but actually aren't. In that case, before and after are advanced without registering a new sentence.
  • If none of these pairs matches, matching will be attempted with the last 3 pairs, possibly detecting a boundary.

As for where did these regexes come from? - I translated this Ruby library, which is generated based on this paper. If you truly want to understand them, there is no alternative but to read the paper.

As far as accuracy goes - I encourage you to test it with different texts. After some experimentation, I was very pleasantly surprised.

In terms of performance - the regexes should be highly performant as all of them have either a \A or \Z anchor, there are almost no repetition quantifiers, and in the places there are - there can't be any backtracking. Still, regexes are regexes. You will have to do some benchmarking if you plan to use this is tight loops on huge chunks of text.


Mandatory disclaimer: excuse my rusty php skills. The following code might not be the most idiomatic php ever, it should still be clear enough to get the point across.


function sentence_split($text) {
    $before_regexes = array('/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
        '/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
        '/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
        '/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs)\.\s))\Z/su',
        '/(?:(?:\b[Ee]tc\.\s))\Z/su',
        '/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
        '/(?:(?:\b\p{L}\.))\Z/su',
        '/(?:(?:\b\p{L}\.\s))\Z/su',
        '/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
        '/(?:(?:[\"”\']\s*))\Z/su',
        '/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
        '/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
        '/(?:(?:\s\p{L}[\.!?…]\s))\Z/su');
    $after_regexes = array('/\A(?:)/su',
        '/\A(?:[\p{N}\p{Ll}])/su',
        '/\A(?:[^\p{Lu}])/su',
        '/\A(?:[^\p{Lu}]|I)/su',
        '/\A(?:[^p{Lu}])/su',
        '/\A(?:\p{Ll})/su',
        '/\A(?:\p{L}\.)/su',
        '/\A(?:\p{L}\.\s)/su',
        '/\A(?:\p{N})/su',
        '/\A(?:\s*\p{Ll})/su',
        '/\A(?:)/su',
        '/\A(?:\p{Lu}[^\p{Lu}])/su',
        '/\A(?:\p{Lu}\p{Ll})/su');
    $is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, true, true, true);
    $count = 13;

    $sentences = array();
    $sentence = '';
    $before = '';
    $after = substr($text, 0, 10);
    $text = substr($text, 10);

    while($text != '') {
        for($i = 0; $i < $count; $i++) {
            if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
                if($is_sentence_boundary[$i]) {
                    array_push($sentences, $sentence);
                    $sentence = '';
                }
                break;
            }
        }

        $first_from_text = $text[0];
        $text = substr($text, 1);
        $first_from_after = $after[0];
        $after = substr($after, 1);
        $before .= $first_from_after;
        $sentence .= $first_from_after;
        $after .= $first_from_text;
    }

    if($sentence != '' && $after != '') {
        array_push($sentences, $sentence.$after);
    }

    return $sentences;
}

$text = "Mr. Entertainment media properties. Fairy Tail 3.5 and Tokyo Ghoul.";
print_r(sentence_split($text));

Upvotes: 13

Puneet Singh
Puneet Singh

Reputation: 3543

Henrik Petterson Please read it completely because i need to repeat few things which already said above.

As above many people have mentioned that if you add a \u modifier it will work on Unicode character is TRUE and it is Working Perfectly in the example mentioned below

http://ideone.com/750lMn

<?php


    function splitSentences($text) {
        $re = '/# Split sentences on whitespace between them.
            (?<=                # Begin positive lookbehind.
              [.!?]             # Either an end of sentence punct,
            | [.!?][\'"]        # or end of sentence punct and quote.
            )                   # End positive lookbehind.
            (?<!                # Begin negative lookbehind.
              Mr\.              # Skip either "Mr."
            | Mrs\.             # or "Mrs.",
            | Ms\.              # or "Ms.",
            | Jr\.              # or "Jr.",
            | Dr\.              # or "Dr.",
            | Prof\.            # or "Prof.",
            | Vol\.             # or "Vol.",
            | A\.D\.            # or "A.D.",
            | B\.C\.            # or "B.C.",
            | Sr\.              # or "Sr.",
            | T\.V\.A\.         # or "T.V.A.",
                                # or... (you get the idea).
            )                   # End negative lookbehind.
            \s+                 # Split on whitespace between sentences.
            /uix';

        $sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
        return $sentences;
    }

$sentences = 'Entertainment media properties. Ã Fairy Tail and Tokyo Ghoul. Entertainment media properties. &Acirc;&nbsp; Fairy Tail and Tokyo Ghoul.';

$sentences = splitSentences($sentences);

print_r($sentences);

Your examples which you have given in comments were not working because they don't have any white space characters between two sentences. And your code specifying it particularly that there must be a white space between sentences.

\s+                 # Split on whitespace between sentences.

The below example which you have in above comments is not working just because there is no space before Â.

http://ideone.com/m164fp

Upvotes: 3

Artyom
Artyom

Reputation: 31273

There is quite complex Unicode Text Segmentation algorithm that deals with various text boundaries including sentence boundaries.

http://unicode.org/reports/tr29/

The best known implementation of this algorithms is by ICU.

I have found this class: http://php.net/manual/en/class.intlbreakiterator.php however it seems to be in git not in mainstream.

So if you want to solve this VERY complex problem in best why I'd suggest to:

  • Get this class from somewhere
  • Write a small PHP plugin that wraps ICU functionality you need - it is actually quite simple as long as you build specific functionality.

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627083

I believe that it is impossible to get a bullet-proof sentence splitter considering user-generated content is not always grammatically and syntactically correct. Moreover, reaching 100% correct results is just impossible due to technical imperfection of scraping/content getting tools that may fail to get clean contents that will either contain whitespace or punctuation rubbish. And finally, business is now more biased towards a good-enough strategy, and if you manage to split the text into 95% of times, it is in most cases considered a success.

Now, any sentence splitting task is an NLP task, and just one, or two, or three regexps are not enough. Rather than think of your own regex chain, I'd advise to use some existing NLP libraries for that.

  1. vanderlee's php-sentence (depends on reasonably gramatically correct punctuation)

The following is a rough list of the rules used to split sentences.

  • Each linebreak separates sentences.
  • The end of the text indicates the end if a sentence if not otherwise ended through proper punctuation.
  • Sentences must be at least two words long, unless a linebreak or end-of-text.
  • An empty line is not a sentence.
  • Each question- or exclamation mark or combination thereof, is considered the end of a sentence.
  • A single period is considered the end of a sentence, unless...
    • It is preceded by one word, or...
    • It is followed by one word.
  • A sequence of multiple periods is not considered the end of a sentence.

Usage example:

<?php
    require_once 'classes/autoloader.php'; // Include the autoloader.
    $text   = "Hello there, Mr. Smith. What're you doing today... Smith,"
            . " my friend?\n\nI hope it's good. This last sentence will"
            . " cost you $2.50! Just kidding :)"; // This is the test text we're going to use
    $Sentence   = new Sentence;   // Create a new instance
    $sentences  = $Sentence->split($text); // Split into array of sentences
    $count      = $Sentence->count($text); // Count the number of sentences
?>
  1. NlpTools is another library you might utilize for this task. Here is a sample code implementing a naive rule based sentence tokenizer:

Sample code:

<?php
include ('vendor/autoload.php');
 
use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
use \NlpTools\Tokenizers\WhitespaceTokenizer;
use \NlpTools\Classifiers\ClassifierInterface;
use \NlpTools\Documents\DocumentInterface;
 
class EndOfSentence implements ClassifierInterface
{
    public function classify(array $classes, DocumentInterface $d) {
        list($token,$before,$after) = $d->getDocumentData();
 
        $dotcnt = count(explode('.',$token))-1;
        $lastdot = substr($token,-1)=='.';
 
        if (!$lastdot) // assume that all sentences end in full stops
            return 'O';
 
        if ($dotcnt>1) // to catch some naive abbreviations U.S.A.
            return 'O';
 
        return 'EOW';
    }
}
$tok = new ClassifierBasedTokenizer(
    new EndOfSentence(),
    new WhitespaceTokenizer()
);
$text = "We are what we repeatedly do.
        Excellence, then, is not an act, but a habit.";
 
print_r($tok->tokenize($text));
 
// Array
// (
//    [0] => We are what we repeatedly do.
//    [1] => Excellence, then, is not an act, but a habit.
// )
 
  1. You can get a PHP/JAVA bridge for using Java StanfordNLP (here is a Java example of splitting text into sentences).

IMPORTANT NOTE: Most NLP tokenization models I tested do not handle glued sentences well. However, if you add a space after a punctuation chain, sentence splitting quality raises. Just add this before sending the text to the sentence splitting function:

$txt = preg_replace('~\p{P}+~', "$0 ", $txt);

Upvotes: 2

Arnold Daniels
Arnold Daniels

Reputation: 16573

If spaces are unreliable, than you could use match on a . followed by any number of spaces, followed by a capital letter.

You can match any capital UTF-8 letter using the Unicode character property \p{Lu}.

You only need to exclude abbreviations which tend to follow own names (person names, company names, etc), since they start with a capital letter.

function splitSentences($text) {
    $re = '/                # Split sentences ending with a dot
        .+?                 # Match everything before, until we find
        (
          $ |               # the end of the string, or
          \.                # a dot
          (?<!              #  Begin negative lookbehind.
            Mr\.            #   Skip either "Mr."
          | Mrs\.           #   or "Mrs.",
                            #   or... (you get the idea).
          )                 #   End negative lookbehind.
          "?                #   Optionally match a quote
          \s*               #   Any number of whitespaces
          (?=               #  Begin positive lookahead
            \p{Lu} |        #   an upper case letter, or
            "               #   a quote
          )
        )
        /iux';

    if (!preg_match_all($re, $text, $matches, PREG_PATTERN_ORDER)) { 
        return [];
    }

    $sentences = array_map('trim', $matches[0]);

    return $sentences;
}

$text = "Mr. Entertainment media properties. Fairy Tail 3.5 and Tokyo Ghoul.";
$sentences = splitSentences($text);

print_r($sentences);

Note: This answer might not be accurate enough for your situation. I'm unable to judge that. It does address the problem as described above and is easily understandable.

Upvotes: 3

bobince
bobince

Reputation: 536567

  is what it looks like when you print a UTF-8 character U+00A0 Non-Breaking Space to a page/console being interpreted as Latin-1. So I think you have a non-breaking space between the sentences, not a normal space.

\s can match a non-breaking space too, but you will need to use the /u modifier to tell preg you are sending it a UTF-8-encoded string. Otherwise it, like your print command, will guess Latin-1 and see it as the two characters  .

Upvotes: 6

Related Questions