creedqq
creedqq

Reputation: 322

Split input string in go by regex

How can I split input strings below by regex in Go? Examples of strings:

I know how to split by dot, but how can I avoid splitting in quotes?

"a.b.c.d" -> ["a", "b", "c", "d"]
"a."b.c".d" -> ["a", "b.c", "d"]
"a.'b.c'.d" -> ["a", "b.c", "d"]

Upvotes: 0

Views: 2220

Answers (3)

Julio
Julio

Reputation: 5308

Here is another option with a somewhat less hacky regex. It uses the trash bin trick. So the real data is on the (first and second) capturing groups.

It works even with nested quotes like this: "a.'b.c'.d.e."f.g.h"" as long as there is not a recursion of 2 or more levels (as in here: "a.'b."c.d"'", quotes inside quotes inside quotes).

The regex is this: ^"|['"](\w+(?:\.\w+)*)['"]|(\w+)

And the code:

package main

import (
    "regexp"
    "fmt"
)

func main() {
    var re = regexp.MustCompile(`^"|['"](\w+(?:\.\w+)*)['"]|(\w+)`)
    var str = `"a.'b.c'.d.e."f.g.h""`

    result := re.FindAllStringSubmatch(str, -1)
    for _, m := range result {
        if (m[1] != "" || m[2] != "") {
            fmt.Print(m[1] + m[2] + "\n")
        }
    }
}

Input:

"a.'b.c'.d.e."f.g.h""

Output:

a
b.c
d
e
f.g.h

Upvotes: 1

Schwern
Schwern

Reputation: 165606

Matching balanced delimiters is a complex problem for regular expressions, as evidenced by John's answer. Unless you're using something like the Go pcre package.

Instead the Go CSV parser can be adapted. Configure it to use . as the separator, lazy quotes (the CSV quote is '), and variable length records.

package main

import (
    "encoding/csv"
    "fmt"
    "io"
    "log"
    "strings"
)

func main() {
    lines := `a.b.c.d
a.\"b.c\".d
a.'b.c'.d
`

    csv := csv.NewReader(strings.NewReader(lines))
    csv.Comma = '.'
    csv.LazyQuotes = true
    csv.FieldsPerRecord = -1
    for {
        record, err := csv.Read()
        if err == io.EOF {
            break
        }
        if err != nil {
            log.Fatal(err)
        }

        fmt.Printf("%#v\n", record)
    }
}

Upvotes: 1

John
John

Reputation: 2425

Since go doesn't support negative lookaheads, I don't think finding a regular expression that matches the . you want to split on will be easy/possible. Instead, you can match the sourrounding text and only capture appropriately:

So the regular expression itself is a bit ugly, but here's the breakdown (ignoring escaped characters for go):

(\'[^.'"]+(?:\.[^.'"]+)+\')|(\"[^.'"]+(?:\.[^.'"]+)+\")|(?:([^.'"]+)\.?)|(?:\.([^.'\"]+))

There are four scenarios that this regular expression matches, and captures certain subsets of these matches:

  • (\'[^.'"]+(?:\.[^.'"]+)+\') - Match and capture single-quoted text
    • \' - Match ' literally
    • [^.'"]+ - Match sequence of non-quotes and non-periods
    • (?:\.[^.'"]+)+ - Match a period followed by a sequence of non-quotes and non-periods, repeated as many times as needed. Not captured.
    • \' - Match ' literally
  • (\"[^.'"]+(?:\.[^.'"]+)+\") - Match and capture double-quoted text
    • Same as above but with double quotes
  • (?:([^.'"]+)\.?) - Match text proceeded by an optional ., not capturing the .
    • ([^.'"]+) - Match and capture sequence of non-quotes and non-periods
    • \.? - Optionally match a period (optional to capture the last bit of delimited text)
  • (?:\.([^.'"]+)) - Match text preceded by a ., not capturing the .
    • Same as above but with the period coming before the capture group, and also non-optional

Example code that dumps the captures:

package main

import (
    "fmt"
    "regexp"
)

func main() {
    re := regexp.MustCompile("('[^.'\"]+(?:\\.[^.'\"]+)+')|(\"[^.'\"]+(?:\\.[^.'\"]+)+\")|(?:([^.'\"]+)\\.?)|(?:\\.([^.'\"]+))")
    txt := "a.b.c.'d.e'"

    result:= re.FindAllStringSubmatch(txt, -1)

    for k, v := range result {
        fmt.Printf("%d. %s\n", k, v)
    }
}

Upvotes: 1

Related Questions