Håkon Hægland
Håkon Hægland

Reputation: 40748

Non-greedy regular expression match for multicharacter delimiters in awk

Consider the string "AB 1 BA 2 AB 3 BA". How can I match the content between "AB" and "BA" in a non-greedy fashion (in awk)?

I have tried the following:

awk '
BEGIN {
    str="AB 1 BA 2 AB 3 BA"
    regex="AB([^B][^A]|B[^A]|[^B]A)*BA"
    if (match(str,regex))
        print substr(str,RSTART,RLENGTH)
}'

with no output. I believe the reason for no match is that there is an odd number of characters between "AB" and "BA". If I replace str with "AB 11 BA 22 AB 33 BA" the regex seems to work..

Upvotes: 5

Views: 4870

Answers (4)

Duane
Duane

Reputation: 21

Mark the field with special text (backspace or \x00 or similar), then include the field splitter in the gsub command to remove the extra field splitter.

$1="\b";
gsub("\b"FS,"")

Upvotes: 1

ericbn
ericbn

Reputation: 10958

For general expressions, I'm using this as a non-greedy match:

function smatch(s, r) {
    if (match(s, r)) {
        m = RSTART
        do {
            n = RLENGTH
        } while (match(substr(s, m, n - 1), r))
        RSTART = m
        RLENGTH = n
        return RSTART
    } else return 0
}

smatch behaves like match, returning:

the position in s where the regular expression r occurs, or 0 if it does not. The variables RSTART and RLENGTH are set to the position and length of the matched string.

Upvotes: 1

hmijail
hmijail

Reputation: 1141

The other answer didn't really answer: how to match non-greedily? Looks like it can't be done in (G)AWK. The manual says this:

awk (and POSIX) regular expressions always match the leftmost, longest sequence of input characters that can match.

https://www.gnu.org/software/gawk/manual/gawk.html#Leftmost-Longest

And the whole manual doesn't contain the words "greedy" nor "lazy". It mentions Extended Regular Expressions, but for greedy matching you'd need Perl-Compatible Regular Expressions. So… no, can't be done.

Upvotes: 5

Tim Pietzcker
Tim Pietzcker

Reputation: 336148

Merge your two negated character classes and remove the [^A] from the second alternation:

regex = "AB([^AB]|B|[^B]A)*BA"

This regex fails on the string ABABA, though - not sure if that is a problem.

Explanation:

AB       # Match AB
(        # Group 1 (could also be non-capturing)
 [^AB]   # Match any character except A or B
|        # or
 B       # Match B
|        # or
 [^B]A   # Match any character except B, then A
)*       # Repeat as needed
BA       # Match BA

Since the only way to match an A in the alternation is by matching a character except B before it, we can safely use the simple B as one of the alternatives.

Upvotes: 5

Related Questions