Reputation: 322
How can I split input strings below by regex in Go? Examples of strings:
I know how to split by dot, but how can I avoid splitting in quotes?
"a.b.c.d" -> ["a", "b", "c", "d"]
"a."b.c".d" -> ["a", "b.c", "d"]
"a.'b.c'.d" -> ["a", "b.c", "d"]
Upvotes: 0
Views: 2220
Reputation: 5308
Here is another option with a somewhat less hacky regex. It uses the trash bin trick. So the real data is on the (first and second) capturing groups.
It works even with nested quotes like this: "a.'b.c'.d.e."f.g.h""
as long as there is not a recursion of 2 or more levels (as in here: "a.'b."c.d"'"
, quotes inside quotes inside quotes).
The regex is this: ^"|['"](\w+(?:\.\w+)*)['"]|(\w+)
And the code:
package main
import (
"regexp"
"fmt"
)
func main() {
var re = regexp.MustCompile(`^"|['"](\w+(?:\.\w+)*)['"]|(\w+)`)
var str = `"a.'b.c'.d.e."f.g.h""`
result := re.FindAllStringSubmatch(str, -1)
for _, m := range result {
if (m[1] != "" || m[2] != "") {
fmt.Print(m[1] + m[2] + "\n")
}
}
}
Input:
"a.'b.c'.d.e."f.g.h""
Output:
a
b.c
d
e
f.g.h
Upvotes: 1
Reputation: 165606
Matching balanced delimiters is a complex problem for regular expressions, as evidenced by John's answer. Unless you're using something like the Go pcre package.
Instead the Go CSV parser can be adapted. Configure it to use .
as the separator, lazy quotes (the CSV quote is '
), and variable length records.
package main
import (
"encoding/csv"
"fmt"
"io"
"log"
"strings"
)
func main() {
lines := `a.b.c.d
a.\"b.c\".d
a.'b.c'.d
`
csv := csv.NewReader(strings.NewReader(lines))
csv.Comma = '.'
csv.LazyQuotes = true
csv.FieldsPerRecord = -1
for {
record, err := csv.Read()
if err == io.EOF {
break
}
if err != nil {
log.Fatal(err)
}
fmt.Printf("%#v\n", record)
}
}
Upvotes: 1
Reputation: 2425
Since go doesn't support negative lookaheads, I don't think finding a regular expression that matches the .
you want to split on will be easy/possible. Instead, you can match the sourrounding text and only capture appropriately:
So the regular expression itself is a bit ugly, but here's the breakdown (ignoring escaped characters for go):
(\'[^.'"]+(?:\.[^.'"]+)+\')|(\"[^.'"]+(?:\.[^.'"]+)+\")|(?:([^.'"]+)\.?)|(?:\.([^.'\"]+))
There are four scenarios that this regular expression matches, and captures certain subsets of these matches:
(\'[^.'"]+(?:\.[^.'"]+)+\')
- Match and capture single-quoted text
\'
- Match '
literally[^.'"]+
- Match sequence of non-quotes and non-periods(?:\.[^.'"]+)+
- Match a period followed by a sequence of non-quotes and non-periods, repeated as many times as needed. Not captured.\'
- Match '
literally(\"[^.'"]+(?:\.[^.'"]+)+\")
- Match and capture double-quoted text
(?:([^.'"]+)\.?)
- Match text proceeded by an optional .
, not capturing the .
([^.'"]+)
- Match and capture sequence of non-quotes and non-periods\.?
- Optionally match a period (optional to capture the last bit of delimited text) (?:\.([^.'"]+))
- Match text preceded by a .
, not capturing the .
Example code that dumps the captures:
package main
import (
"fmt"
"regexp"
)
func main() {
re := regexp.MustCompile("('[^.'\"]+(?:\\.[^.'\"]+)+')|(\"[^.'\"]+(?:\\.[^.'\"]+)+\")|(?:([^.'\"]+)\\.?)|(?:\\.([^.'\"]+))")
txt := "a.b.c.'d.e'"
result:= re.FindAllStringSubmatch(txt, -1)
for k, v := range result {
fmt.Printf("%d. %s\n", k, v)
}
}
Upvotes: 1