Henrik Petterson
Henrik Petterson

Reputation: 7094

If line has less than x words, then strip it

I have the following text in $text:

$text = 'Hello world, lorem ipsum.

What?

Hello world, lorem ipsum what.

Excuse me!';

If the words on a line are less than 3 words, then I want to remove that line completely. So the lines with What? and Excuse me! should be removed from the string.

Is there a regex approach or how do I go about this?

Upvotes: 3

Views: 74

Answers (3)

anubhava
anubhava

Reputation: 785316

You can use this negative lookahead regex:

preg_replace('/^(?!(?:\h*\S+\h+){2}\S+).*\R*/m', '', $text);

Output:

Hello world, lorem ipsum.
Hello world, lorem ipsum what.

(?!(?:\S+\h+){3}) will match any line that doesn't have 3 non-space words. \R matches a newline character in PHP regex.

Without lookahead Use preg_grep:

echo implode("\n", preg_grep('/^\h*(?:\S+\h+){2}\S+/', explode("\n", $text)));
Hello world, lorem ipsum.
Hello world, lorem ipsum what.

RegEx Demo

Code Demo

Upvotes: 2

Lando
Lando

Reputation: 417

I came up with this. Avoiding regex when possible is my preference, as regex tends to slow things down.

$str = 'Hello world, lorem ipsum.

What?

Hello world, lorem ipsum what.';

$new_str = explode("\n", $str);

foreach ($new_str as $keys => &$lines) {
    $lines = trim($lines);
    if (substr_count($lines, " ") < 2) {
         unset($new_str[$keys]);
    }
}

$new_str = implode("\n", $new_str);
print_r($new_str);

Which prints out this:

Hello world, lorem ipsum.
Hello world, lorem ipsum what.

Upvotes: 3

trincot
trincot

Reputation: 350365

You could use this regular expression in preg_replace:

$test = preg_replace("/^(?!\h*\S+\h+\S+\h+\S+).*$\R?/m", "", $text);

Testing with input that touches on some additional boundary conditions:

$text = 'Hello world, lorem ipsum.
What? ending-spaces   
    Hello world, lorem
  Hello world, lorem ipsum what.
ending text';

$test = preg_replace("/^(?!\h*\S+\h+\S+\h+\S+).*$\R?/m", '', $text);

echo $test;

Output:

Hello world, lorem ipsum.
   Hello world, lorem
 Hello world, lorem ipsum what.

The (?! part looks ahead to see if -- after some optional horiontal blanks (\h*) -- there are three words (\S+) separated by (horizontal) blanks (\h+), and if so, does not match (so the line is not removed). In all other cases the .*$ will match anything until the end of the line, including the line-break (\R) if present (?) and will be replaced by an empty string, in order to remove that line.

The m modifier will make ^ and $ match with the beginning and end of a line respectively (instead of beginning and end of complete string).

Here is a fiddle using the above input and regex.

Upvotes: 1

Related Questions