Reputation: 3831
I am trying to match consecutive lines that starts with an arbitrary amount of space followed by the character |
. I am using the s
flag, so that .
matches newlines.
What I have so far works with a finite amount of whitespace before |
.
I am having issues with the part that determines that a line is reached that does not meet the requirements. For some reason \n\s*[^\|]
does not do the trick. What I am doing right now is the following:
(?P<terminating>
\n( # when newline is encountered...
[^\|\s] # check if next character is not: (| or space)
|
[\s][^\|\s] # check if next characters are not: space + (| or space)
|
[\s][\s][^\|\s] # check if next characters are not: space + space + (| or space)... And so on....
)
|
$
)
This obviously only works for two spaces. I would like to make this work for an arbitrary amount of spaces. I looked into recursion, but it seems like that is quite the heavy gun to wield in this case. Here now is my question: Why does \n\s*[^\|]
not work, and is there another way of solving this without recursion?
Below is an example of input and the resulting match I would like to get:
Input string:
Lorem ipsum dolor sit amet,
consectetur adipisicing
elit,
|sed do
|eiusmod tempor incididunt
|ut labore et dolore magna aliqua.
Ut enim ad minim veniam,
quis nostrud exercitation
ullamco laboris nisi ut
aliquip ex ea commodo consequat.
Output is one string with content:
|sed do\n |eiusmod tempor incididunt\n |ut labore et dolore magna aliqua.
I don't want three matches with each of the lines that have |
in it.
Upvotes: 0
Views: 272
Reputation: 11
For those who use perl, you may use the below code. I am sure it can be better. I would be happy to learn if someone could help me enhance the code
my $Str = "Lorem ipsum dolor sit amet,
consectetur adipisicing
elit,
|sed do
|eiusmod tempor incididunt
|ut labore et dolore magna aliqua.
Ut enim ad minim veniam,
quis nostrud exercitation
ullamco laboris nisi ut
aliquip ex ea commodo consequat.";
@lLine = split('\n', $Str);
foreach $lLine (@lLine) {
if($lLine =~ /^[\s\|]+.*$/) {
$ReturnStr .= $lLine;
}
}
The output was: |sed do |eiusmod tempor incididunt |ut labore et dolore magna aliqua.
Upvotes: 0
Reputation: 75252
If you're using PHP, this should do it:
(?m)^\h*\|.*(?:\R\h*\|.*)*
Some points of interest:
\h
matches horizontal whitespace, meaning space and tab characters
\R
matches a line separator, whether it be \n
, \r\n
, or \r
(?m)
turns on multiline mode, which allows ^
to match the beginning of a line
singleline/DOTALL mode is not set, because we want the .*
to stop at the end of the line.
I never use \s
because it matches any whitespace character, including space, tab, carriage return (\r
) and linefeed (\n
). If you just want to find a match that might span multiple lines, it's okay to use \s
or .
in singleline mode. But this task involves matching things based on their position relative to the beginning of the line. That's much easier to do if you match the different kinds of whitespace character explicitly.
If you're using Python the \h
and \R
shorthands won't work, so you'll have to be more verbose:
(?m)^[ \t]*\|.*(?:[\r\n]+[ \t]*\|.*)*
Note that [\r\n]+
will also match empty lines; if you want to make sure there's exactly one line separator between lines, use this instead:
(?m)^[ \t]*\|.*(?:(?:\r\n|[\r\n])[ \t]*\|.*)*
Upvotes: 2
Reputation: 3831
I solved it myself. I guess I have to exclude the space from the character group I am excluding:
n\s*[^\|\s]
Not quite sure why this is though, I stumbled upon this by sheer accident. I would be grateful if someone could explain the reasoning behind this.
The full expression now is as follows:
'/
(?:
(^|\n)\s*\|
)
(?P<main>
.*?
)
(?=
\n\s*[^\|\s]
|
$
)
/sx'
Upvotes: 0
Reputation: 89639
You can try this pattern without the s modifier:
(?:(?:^|(?<=\n))[^\S\r\n]*\|.*(?:\r?\n|$)?)+
Upvotes: 1