DudeOnRock
DudeOnRock

Reputation: 3831

Matching *consecutive* lines that begin with an arbitrary amount of whitespace followed by a character

I am trying to match consecutive lines that starts with an arbitrary amount of space followed by the character |. I am using the s flag, so that . matches newlines.

What I have so far works with a finite amount of whitespace before |.

I am having issues with the part that determines that a line is reached that does not meet the requirements. For some reason \n\s*[^\|] does not do the trick. What I am doing right now is the following:

(?P<terminating>
    \n(             # when newline is encountered...
        [^\|\s]         #   check if next character is not: (| or space)
        |
        [\s][^\|\s]     #   check if next characters are not: space + (| or space)
        |
        [\s][\s][^\|\s] #   check if next characters are not: space + space + (| or space)... And so on....
    )
    |
    $
)

This obviously only works for two spaces. I would like to make this work for an arbitrary amount of spaces. I looked into recursion, but it seems like that is quite the heavy gun to wield in this case. Here now is my question: Why does \n\s*[^\|] not work, and is there another way of solving this without recursion?


Below is an example of input and the resulting match I would like to get:

Input string:

Lorem ipsum dolor sit amet, 
consectetur adipisicing 
elit, 
|sed do 
        |eiusmod tempor incididunt 
     |ut labore et dolore magna aliqua.
Ut enim ad minim veniam, 
quis nostrud exercitation 
ullamco laboris nisi ut 
aliquip ex ea commodo consequat.

Output is one string with content:

|sed do\n        |eiusmod tempor incididunt\n     |ut labore et dolore magna aliqua.

I don't want three matches with each of the lines that have | in it.

Upvotes: 0

Views: 272

Answers (4)

Kamleein
Kamleein

Reputation: 11

For those who use perl, you may use the below code. I am sure it can be better. I would be happy to learn if someone could help me enhance the code

my $Str = "Lorem ipsum dolor sit amet,
consectetur adipisicing
elit,
|sed do
        |eiusmod tempor incididunt
     |ut labore et dolore magna aliqua.
Ut enim ad minim veniam,
quis nostrud exercitation
ullamco laboris nisi ut
aliquip ex ea commodo consequat.";
@lLine = split('\n', $Str);
foreach $lLine (@lLine) {
    if($lLine =~ /^[\s\|]+.*$/) {
        $ReturnStr .= $lLine;
    }
}

The output was: |sed do |eiusmod tempor incididunt |ut labore et dolore magna aliqua.

Upvotes: 0

Alan Moore
Alan Moore

Reputation: 75252

If you're using PHP, this should do it:

(?m)^\h*\|.*(?:\R\h*\|.*)*

Some points of interest:

  • \h matches horizontal whitespace, meaning space and tab characters

  • \R matches a line separator, whether it be \n, \r\n, or \r

  • (?m) turns on multiline mode, which allows ^ to match the beginning of a line

  • singleline/DOTALL mode is not set, because we want the .* to stop at the end of the line.

  • I never use \s because it matches any whitespace character, including space, tab, carriage return (\r) and linefeed (\n). If you just want to find a match that might span multiple lines, it's okay to use \s or . in singleline mode. But this task involves matching things based on their position relative to the beginning of the line. That's much easier to do if you match the different kinds of whitespace character explicitly.

If you're using Python the \h and \R shorthands won't work, so you'll have to be more verbose:

(?m)^[ \t]*\|.*(?:[\r\n]+[ \t]*\|.*)*

Note that [\r\n]+ will also match empty lines; if you want to make sure there's exactly one line separator between lines, use this instead:

(?m)^[ \t]*\|.*(?:(?:\r\n|[\r\n])[ \t]*\|.*)*

Upvotes: 2

DudeOnRock
DudeOnRock

Reputation: 3831

I solved it myself. I guess I have to exclude the space from the character group I am excluding:

n\s*[^\|\s]

Not quite sure why this is though, I stumbled upon this by sheer accident. I would be grateful if someone could explain the reasoning behind this.

The full expression now is as follows:

'/
    (?:
        (^|\n)\s*\|
    )
    (?P<main>
        .*?
    )
    (?=
        \n\s*[^\|\s]
        |
        $
    )
/sx'

Upvotes: 0

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89639

You can try this pattern without the s modifier:

(?:(?:^|(?<=\n))[^\S\r\n]*\|.*(?:\r?\n|$)?)+

Upvotes: 1

Related Questions