mbudge
mbudge

Reputation: 557

Go regex wildcard to get a tag without surrounding text

I'm trying to get the value "done" in the following which is in a byte slice returned at the end of a chunked http stream.

X-sync-status: done\r\n

This is the go regex I've done so far

syncStatusRegex = regexp.MustCompile("(?i)X-sync-status:(.*)\r\n")

I just want it to return this bit

(.*)

This is the code to get the status

syncStatus := strings.TrimSpace(string(syncStatusRegex.Find(body)))
fmt.Println(syncStatus)

How do I get it to just return "done" and not the header?

Thanks

Upvotes: 0

Views: 377

Answers (1)

Markus W Mahlberg
Markus W Mahlberg

Reputation: 20703

What you want to achieve is to access the capturing groups. I prefer named capturing groups and there is an extremely simple helper function to deal with that:

package main

import (
    "fmt"
    "regexp"
)

// Our example input
const input = "X-sync-status: done\r\n"

// We anchor the regex to the beginning of a line with "^".
// Then we have a fixed string until our capturing group begins.
// Within our capturing group, we want to have all consecutive non-whitespace,
// non-control characters following.
const regexString = `(?i)^X-sync-status: (?P<status>\w*)`

// We ensure our regexp is valid and can be used.
var syncStatusRegexp *regexp.Regexp = regexp.MustCompile(regexString)


// The helper function...
func namedResults(re *regexp.Regexp, in string) map[string]string {

    // ... does the matching
    match := re.FindStringSubmatch(in)

    result := make(map[string]string)

    // and puts the value for each named capturing group
    // into the result map
    for i, name := range re.SubexpNames() {
        if i != 0 && name != "" {
            result[name] = match[i]
        }
    }
    return result
}

func main() {
    fmt.Println(namedResults(syncStatusRegexp, input)["status"])
}

Run on playground

Note Your current regexp is somewhat faulty, since you would capture whitespace as well. With your current regexp, the result would be " done" instead of "done".

Edit: Of course, you can do this much cheaper without regexp:

fmt.Print(strings.Trim(strings.Split(input, ":")[1], " \r\n"))

Run on playground

Edit2 I was curious how much cheaper the split method was, and hence I came up with the very crude:

package main

import (
    "fmt"
    "log"
    "regexp"
    "strings"
)

// Our example input
const input = "X-sync-status: done\r\n"

// We anchor the regex to the beginning of a line with "^".
// Then we have a fixed string until our capturing group begins.
// Within our capturing group, we want to have all consecutive non-whitespace,
// non-control characters following.
const regexString = `(?i)^X-sync-status: (?P<status>\w*)`

// We ensure our regexp is valid and can be used.
var syncStatusRegexp *regexp.Regexp = regexp.MustCompile(regexString)

func statusBySplit(in string) string {
    return strings.Trim(strings.Split(input, ":")[1], " \r\n")
}

func statusByRegexp(re *regexp.Regexp, in string) string {
    return re.FindStringSubmatch(in)[1]
}

[...]

and a little benchmark:

package main

import "testing"

func BenchmarkRegexp(b *testing.B) {
    for i := 0; i < b.N; i++ {
        statusByRegexp(syncStatusRegexp, input)
    }
}

func BenchmarkSplit(b *testing.B) {
    for i := 0; i < b.N; i++ {
        statusBySplit(input)
    }
}

Then, I let those run 5 times each on one, two and 4 CPUs available. The result imho is pretty convincing:

go test -run=^$ -test.bench=.  -test.benchmem -test.cpu 1,2,4 -test.count=5
goos: darwin
goarch: amd64
pkg: github.com/mwmahlberg/so-regex
BenchmarkRegexp          5000000               383 ns/op              32 B/op          1 allocs/op
BenchmarkRegexp          5000000               382 ns/op              32 B/op          1 allocs/op
BenchmarkRegexp          5000000               382 ns/op              32 B/op          1 allocs/op
BenchmarkRegexp          5000000               382 ns/op              32 B/op          1 allocs/op
BenchmarkRegexp          5000000               384 ns/op              32 B/op          1 allocs/op
BenchmarkRegexp-2        5000000               384 ns/op              32 B/op          1 allocs/op
BenchmarkRegexp-2        5000000               382 ns/op              32 B/op          1 allocs/op
BenchmarkRegexp-2        5000000               384 ns/op              32 B/op          1 allocs/op
BenchmarkRegexp-2        5000000               382 ns/op              32 B/op          1 allocs/op
BenchmarkRegexp-2        5000000               382 ns/op              32 B/op          1 allocs/op
BenchmarkRegexp-4        5000000               382 ns/op              32 B/op          1 allocs/op
BenchmarkRegexp-4        5000000               382 ns/op              32 B/op          1 allocs/op
BenchmarkRegexp-4        5000000               380 ns/op              32 B/op          1 allocs/op
BenchmarkRegexp-4        5000000               380 ns/op              32 B/op          1 allocs/op
BenchmarkRegexp-4        5000000               377 ns/op              32 B/op          1 allocs/op
BenchmarkSplit          10000000               161 ns/op              80 B/op          3 allocs/op
BenchmarkSplit          10000000               161 ns/op              80 B/op          3 allocs/op
BenchmarkSplit          10000000               164 ns/op              80 B/op          3 allocs/op
BenchmarkSplit          10000000               165 ns/op              80 B/op          3 allocs/op
BenchmarkSplit          10000000               162 ns/op              80 B/op          3 allocs/op
BenchmarkSplit-2        10000000               159 ns/op              80 B/op          3 allocs/op
BenchmarkSplit-2        10000000               167 ns/op              80 B/op          3 allocs/op
BenchmarkSplit-2        10000000               161 ns/op              80 B/op          3 allocs/op
BenchmarkSplit-2        10000000               159 ns/op              80 B/op          3 allocs/op
BenchmarkSplit-2        10000000               159 ns/op              80 B/op          3 allocs/op
BenchmarkSplit-4        10000000               159 ns/op              80 B/op          3 allocs/op
BenchmarkSplit-4        10000000               161 ns/op              80 B/op          3 allocs/op
BenchmarkSplit-4        10000000               159 ns/op              80 B/op          3 allocs/op
BenchmarkSplit-4        10000000               160 ns/op              80 B/op          3 allocs/op
BenchmarkSplit-4        10000000               160 ns/op              80 B/op          3 allocs/op
PASS
ok      github.com/mwmahlberg/so-regex  61.340s

It clearly shows that in the case of splitting tags, actually using a split is more than twice as fast as a precompiled regexp. For your use case, I would clearly go for using split, then.

Upvotes: 1

Related Questions