Can one find out which input characters matched which part of a regex?

Question

I'm trying to build a tool that uses something like regexes to find patterns in a string (not a text string, but that is not important right now). I'm familiar with automata theory, i.e. I know how to implement basic regex matching, and output true or false if the string matches my regex, by simulating an automaton in the textbook way.

Say I'm interested in all as that comes before bs, with no more as before the bs, so, this regex: a[^a]*b. But I don't just want to find out if my string contains such a part, I want to get as output the a, so that I can inspect it (remember, I'm not actually dealing with text).

In summary: Let's say I mark the a with parentheses, like so: (a)[^a]*b and run it on the input string bcadacb then I want the second a as output.

Or, more generally, can one find out which characters in the input string matches which part of the regex? How is it done in text editors? They at least know where the match started, because they can highlight the matches. Do I have to use a backtracking approach, or is there a smarter, less computationally expensive, way?

EDIT: Proper back references, i.e. capturing with parens and referencing with \1, etc. may not be necessary. I do know that back references do introduce the need for backtracking (or something similar) and make the problem (IIRC) NP-hard. My question, in essence, is: Is the capturing part, without the back referencing, less computationally expensive than proper back references?

Russ Cox · Accepted Answer

Most text editors do this by using a backtracking algorithm, in which case recording the match locations is trivial to add.

It is possible to do with a direct NFA simulation too, by augmenting the state lists with parenthesis location information. This can be done in a way that preserves the linear time guarantee. See http://swtch.com/~rsc/regexp/regexp2.html#submatch.

Timos's answer is on the right track, but you cannot tag DFA states, because a DFA state corresponds to a collection of possible NFA states, and so one DFA state might represent the possibility of having passed a paren (but maybe something else too) and if that turns out not to be the case, it would be incorrect to record it as fact. You really need to work on the NFA simulation instead.

Can one find out which input characters matched which part of a regex?

Answers (2)

Related Questions