Reputation: 93
How to avoid ambiguity when directly constructing DFA from regular expression like [ab]*ab
?
In a straightforward implementation a cycle transition for [ab]*
eats all a or b`s, and it of course differs from existing implementations for regular expressions, they take into account "ab" at the end as well. So the implementing automata must somehow know when to jump to last two transitions "a to b" from the first transition cycle.
Any ideas how to achieve this?
I am interested in answers for direct DFA constructing, not NFA with transformation to DFA. Adrian McCarthy's answer probably took the point:
"The simplistic answer is that the automaton simply has a state that literally means it doesn't know yet which path was taken."
- but without detailed explanation.
Upvotes: 1
Views: 1237
Reputation: 536
Using http://hackingoff.com/compilers/regular-expression-to-nfa-dfa with the regexp string:
(a|b)*ab
The DFA generated from NFA has three states:
S0: initial state
S1: after an 'a'
S2: after a 'b' that doesn't follow an a (follows a 'b' or is initial 'b')
S3: after a 'b' in S1
S3: accept
Existing implementations for regular expressions often allow backreferences (not so RE2), and use backtracking rather than NFA -> DFA, and are not therefore DFA (the backtracking logic adds state, similar to recursive call stack adding state). If you don't want to use backtracking, one way to parse expressions like this may be going backwards from accept state:
ab
S0 -a-> S1 -b-> S2 (accept)
[ab]*ab
S0 -a-> S1 -b-> S2 (accept)
Sn -a-> S1
S0,S2 -b-> S3
The NFA reduction allows for seeing that the case of 'a' is different (as it is partial match of existing final sequence), we should go to state S1 instead. The same link above gives larger than necessary DFA diagram for a longer string (their states 2 and 3 should be combined to a single state, any 'b' or 'c' stays in state, 'a' goes out to state 1):
(a|b|c)*abc
To create DFA yourself, you can first generate the original DFA for abc:
S0 -a-> S1 -b-> S2 -c-> S3 (accept)
and then fill in whatever doesn't match this track (top one highest priority, overrides whatever is next):
S0 -a-> S1 -b-> S2 -c-> S3 (accept)
Sn -a-> S1 (get on our track of final string / ambiguity state)
Sn -b,c-> S4 (b or c; get off our track)
This fills in to:
S0 -a-> S1 -b-> S2 -c-> S3 (accept)
S1,S2,S3,S4 -a-> S1
S0,S3,S4 -b,c-> S4
S1 -c-> S4
S2 -b-> S4
If we instead had:
(a|c)*aac
You should think of the accept track similar to abc:
S0 -a-> S1 -a-> S2 -c-> S3 (accept)
But also notice the sequence 'aa'. The ambiguity is relatively simple, still, as seeing any of these should keep you in state S2:
aa aaa aaaa aaaaa...a
The ambiguities are:
We can hierarchically fill in states from the track, but the ambiguity overlaps to S2 this time as well:
S0 -a-> S1 -a-> S2 -c-> S3 (accept)
S2 -a-> S2 (additional here, the second ambiguity)
Sn -a-> S1 (get on our track of final string / ambiguity state)
Sn -c-> S4 (get off our track)
This fills in to:
S0 -a-> S1 -a-> S2 -c-> S3 (accept)
S2 -a-> S2
S3,S4 -a-> S1
S0,S1,S3,S4 -c-> S4
My general observation (as someone who taught compiler design):
Beyond simple cases, to use backreferences to groupings and keeping that state as well, you are not talking about simple DFA anymore. It can still be converted to DFA, but if you don't backtrack (not a part of your standard DFA, backtracking logic adds additional state), your state space will be exponentially large, and there is no general simple approach to directly convert to DFA, you should really construct NFA first.
See How do backreferences in regexes make backtracking required?
Upvotes: 3