Reputation: 9372
I want to speed up my program written in Go and convert regular expressions to finite state machines with ragel
. I cannot figure out how to match end of input correctly when converting regular expressions similar to /cat[s]?(\b|$)/
(it matches a word border or end of input), so I made this workaround:
package main
import(
"strings"
"fmt"
"unicode"
)
func Match(data []byte) bool {
data = []byte(strings.ToLower(string(data)))
%%{
machine test;
write data;
}%%
cs, p, pe, eof := 0, 0, len(data), len(data)
_ = eof
var matchPos int
%%{
main := ('cat' 's'?) @{matchPos = p};
write init;
write exec;
}%%
return checkMatch(data, matchPos+1)
}
func checkMatch(data []byte, p int) bool {
if p == len(data) {
return true
}
tail := string(data[p:])
c := []rune(tail)[0]
if !unicode.IsLetter(c) && !unicode.IsDigit(c) {
return true
}
return false
}
func main () {
vs := []string{
"cat",
"cats",
"cat and dog",
"cats and dogs",
"caterpillar",
}
for _, v := range vs {
fmt.Printf("'%s': %v\n", v, Match([]byte(v)))
}
}
the output is correct:
'cat': true
'cats': true
'cat and dog': true
'cats and dogs': true
'caterpillar': false
I do think there is a better way though. What is a "proper" way to handle end of input with ragel
?
Upvotes: 0
Views: 418
Reputation: 4977
The proper way to handle end of input is to use EOF actions, of course. And use actions in general, like this (reduced Match
function):
var matched bool
%%{
action setMatched {
matched = true
}
main := ('cat' 's'?) %/setMatched ([ \t] >setMatched);
write init;
write exec;
}%%
// Variable generated and not used by Ragel.
_ = _test_trans_actions
return matched
Which produces the following output (notice one important test case added):
'cat': true
'cats': true
'cat and dog': true
'cats and dogs': true
'catspaw': false
'caterpillar': false
What it adds is setMatched
action that is triggered either by the EOF in one of the final states (%/setMatched
) of the first machine (cats?
), or upon entering (>setMatched
) the second one (which is almost \b
, but can actually be replaced with the internal space
machine). And it eliminates the need for checkMatch
completely.
Upvotes: 1