oskarkv
oskarkv

Reputation: 2379

Can one find out which input characters matched which part of a regex?

I'm trying to build a tool that uses something like regexes to find patterns in a string (not a text string, but that is not important right now). I'm familiar with automata theory, i.e. I know how to implement basic regex matching, and output true or false if the string matches my regex, by simulating an automaton in the textbook way.

Say I'm interested in all as that comes before bs, with no more as before the bs, so, this regex: a[^a]*b. But I don't just want to find out if my string contains such a part, I want to get as output the a, so that I can inspect it (remember, I'm not actually dealing with text).

In summary: Let's say I mark the a with parentheses, like so: (a)[^a]*b and run it on the input string bcadacb then I want the second a as output.

Or, more generally, can one find out which characters in the input string matches which part of the regex? How is it done in text editors? They at least know where the match started, because they can highlight the matches. Do I have to use a backtracking approach, or is there a smarter, less computationally expensive, way?

EDIT: Proper back references, i.e. capturing with parens and referencing with \1, etc. may not be necessary. I do know that back references do introduce the need for backtracking (or something similar) and make the problem (IIRC) NP-hard. My question, in essence, is: Is the capturing part, without the back referencing, less computationally expensive than proper back references?

Upvotes: 2

Views: 544

Answers (2)

Russ Cox
Russ Cox

Reputation: 1931

Most text editors do this by using a backtracking algorithm, in which case recording the match locations is trivial to add.

It is possible to do with a direct NFA simulation too, by augmenting the state lists with parenthesis location information. This can be done in a way that preserves the linear time guarantee. See http://swtch.com/~rsc/regexp/regexp2.html#submatch.

Timos's answer is on the right track, but you cannot tag DFA states, because a DFA state corresponds to a collection of possible NFA states, and so one DFA state might represent the possibility of having passed a paren (but maybe something else too) and if that turns out not to be the case, it would be incorrect to record it as fact. You really need to work on the NFA simulation instead.

Upvotes: 4

timos
timos

Reputation: 2707

After you constructed your DFA for the matching, mark all states which correspond to the first state after an opening parenthesis in the regex. When you visit such a state, save the index of the current input character, when you visit a state which corresponds to a closing parenthesis, also save the index. When you reach an accepting state, output the two indices. I am not sure if this is the algorithm used in text editors, but that's how I would do it.

Upvotes: 1

Related Questions