John Howard
John Howard

Reputation: 64135

Match until character but, don't include that character

I am trying to match against inputs like:

foo=bar baz foo:1  foo:234.mds32  notfoo:baz  foo:bak foo:nospace foo:bar

and output 6 matches: everything but the notfoo. The matches should be like foo:bar (ie not including trailing or leading spaces.

In general, the rules I am trying to match are:

The current best regex I have for this is '(?:\s|^)(?P<primary>foo[:=].+?)\s', and then extracting the primary group.

The problem with this is because we are including the \s as part of the match, we run into issues with overlapping regex: the foo:bak foo:nospace foo:bar is broken because we are attempt the whitespace character is matched 2x, and golang regex doesn't return overlapping matches.

In other regex engines I think lookahead can be used, but as far as I can tell this is not allowed with golang regex.

Is there any way to accomplish this?

Go playground link: https://play.golang.org/p/n8gnWwpiBSR

Upvotes: 0

Views: 896

Answers (3)

TomOnTime
TomOnTime

Reputation: 4477

Other people have given excellent answers using regular expressions as requested. Might I be so bold as to suggest a non-regex answer?

I find that regex's are not the best solution for this situation. It is better to split the string using strings.Fields(original) to get a list of substrings. For each string, split it based on whether it has a = or : or neither. The Fields() function does a great job of parsing similar to the default split in awk, which skips multiple spaces in a row.

Working example here: https://play.golang.org/p/xXaA9skdplz


    original := `foo=bar baz foo:1  foo:234.mds32  notfoo:baz  foo:bak foo:nospace foo:bar`

    for _, item := range strings.Fields(original) {
        if kv := strings.SplitN(item, "=", 2); len(kv) == 2 {
            fmt.Printf("key/value: %q -> %q\n", kv[0], kv[1])
        } else if kv := strings.SplitN(item, ":", 2); len(kv) == 2 {
            fmt.Printf("key/value: %q -> %q\n", kv[0], kv[1])
        } else {
            fmt.Printf("key: %q\n", item)
        }

    }

Obviously you'll need to modify this code to collect the answers rather than print them.

If you have to use regex's, then please use the other answers.

Upvotes: 1

hobbs
hobbs

Reputation: 239930

There are several approaches you could take:

  1. Just change your pattern to (?:\s|^)(?P<primary>foo[:=]\S+) as Wiktor Stribiżew mentions in a comment, instead of matching .+? up to \s. This solves the problem with no shenanigans, but I will list a few more options that might be applicable to similar problems that couldn't be so easily negated.

  2. Since the problem is with the FindAll functions not allowing the overlap, don't use them! Instead, roll your own, using FindStringSubmatchIndex to get the boundaries of one match, extract the matched text by slicing the string, then do d = d[endIndex-1:] and loop until FindStringSubmatchIndex returns nil.

  3. Use regexp.Split() with a pattern of \s+ to break the input string into whitespace-separated components, then just discard the ones that don't regexp.Match() on ^foo[:=]. You could even use strings.HasPrefix("foo:") || strings.HasPrefix("foo=") instead. The remaining ones will be your desired matches, and the whitespace around them will have already been discarded by the split. In my opinion this version conveys intent more clearly than trying to use a match.

Upvotes: 2

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626870

It is a pity there is no lookaround support in Go regexp, thus, you can work around this limitation by doubling whitespaces (e.g. with regexp.MustCompile(\s).ReplaceAllString(d, "$0$0")) and then matching with (?:\s|^)(?P<primary>foo[:=]\S+(?:\s+[^:\s]+)*)(?:\s|$):

package main

import (
    "fmt"
    "regexp"
)

func main() {
    var d = `foo=bar baz foo:1  foo:234.mds32  notfoo:baz  foo:bak foo:nospace foo:bar`
    d = regexp.MustCompile(`\s`).ReplaceAllString(d, "$0$0")
    r := regexp.MustCompile(`(?:\s|^)(?P<primary>foo[:=]\S+(?:\s+[^:\s]+)*)(?:\s|$)`)
    idx := r.SubexpIndex("primary")
    for _, m := range r.FindAllStringSubmatch(d, -1) {
        fmt.Printf("%q\n", m[idx])
    }
}

See the Go demo. Output:

"foo=bar  baz"
"foo:1"
"foo:234.mds32"
"foo:bak"
"foo:nospace"
"foo:bar"

Details:

  • (?:\s|^) - a whitespace or start of string
  • (?P<primary>foo[:=]\S+(?:\s+[^:\s]+)*) - Group "primary": foo, a colon or = char, one or more non-whitespaces, and then zero or more occurrences of one or more whitespaces and then one or more chars other than a whitespace or colon
  • (?:\s|$) - a whitepace or end of string.

Upvotes: 2

Related Questions