mustafa
mustafa

Reputation: 745

Finding sentences between characters

I am trying to find sentences between pipe | and dot ., e.g.

| This is one. This is two.

The regex pattern I use :

preg_match_all('/(:\s|\|+)(.*?)(\.|!|\?)/s', $file0, $matches);

So far I could not manage to capture both sentences. The regex I use captures only the first sentence.

How can I solve this problem?

EDIT: as it may seen from the regex, I am trying to find the sentences BETWEEN (: or |) AND (. or ! or ?)

Column or pipe indicates starting point for sentences. The sentences might be:

: Sentence one. Sentence two. Sentence three. 
| Sentence one. Sentence two? 
| Sentence one. Sentence two! Sentence three?

Upvotes: 0

Views: 68

Answers (4)

Booboo
Booboo

Reputation: 44108

To keep it simple, find everything between | and . and then split:

$input = "John loves Mary. | This is one. This is two. | Sentence 1. Sentence 2.";
preg_match_all('/\|\s*([^|]+)\./', $input, $matches);
if ($matches) {
    foreach($matches[1] as $match) {
        print_r(preg_split('/\.\s*/', $match));
    }
}

Prints:

Array
(
    [0] => This is one
    [1] => This is two
)
Array
(
    [0] => Sentence 1
    [1] => Sentence 2
)

Upvotes: 0

The fourth bird
The fourth bird

Reputation: 163277

Another option is to make use of \G to get iterative matches asserting the position at the end of the previous match and capture the values in a capturing group matching a dot and 0+ horizontal whitespace chars after.

(?:\|\h*|\G(?!^))([^.\r\n]+)\.\h*

In parts

  • (?: Non capturing group
    • \|\h* Match | and 0+ horizontal whitespace chars
    • | Or
    • \G(?!^) Assert position at the end of previous match
  • ) Close group
  • ( Capture group 1 - [^.\r\n]+ Match 1+ times any char other than . or a newline
  • ) Close group
  • \.\h* Match 1 . and 0+ horizontal whitespace chars

Regex demo | Php demo

For example

$re = '/(?:\|\h*|\G(?!^))([^.\r\n]+)\.\h*/';
$str = '| This is one. This is two.
John loves Mary.| This is one. This is two.';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
print_r($matches);

Output

Array
(
    [0] => Array
        (
            [0] => | This is one. 
            [1] => This is one
        )

    [1] => Array
        (
            [0] => This is two
            [1] => This is tw
        )

)

Upvotes: 1

Toto
Toto

Reputation: 91385

This does the job:

$str = '| This is one. This is two.';
preg_match_all('/(?:\s|\|)+(.*?)(?=[.!?])/', $str, $m);
print_r($m)

Output:

Array
(
    [0] => Array
        (
            [0] => | This is one
            [1] =>  This is two
        )

    [1] => Array
        (
            [0] => This is one
            [1] => This is two
        )

)

Demo & explanation

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521093

I would keep it simple and just match on:

\s*[^.|]+\s*

This says to match any content not consisting of pipes or full stops, and it also trims optional whitespace before/after each sentence.

$input = "| This is one. This is two.";
preg_match_all('/\s*[^.|]+\s*/s', $input, $matches);
print_r($matches[0]);

This prints:

Array
(
    [0] =>  This is one
    [1] =>  This is two
)

Upvotes: 1

Related Questions