Viktor Rozenko
Viktor Rozenko

Reputation: 91

How do I check if a string is a valid prefix of a regex?

So I'm using Rust to build a parser for my programming language and I need it to well ... parse. I'm implementing a Stream struct that is responsible for some basic operations on Vec<char> that is possesses. One of those functions is as follows:

fn consume_while_regex_matches(&self, regex: Regex, mut acc: String) -> (Self, String)

It's a recursive function that takes a regex and an accumulator which it's supposed to be filling up until the next char in the stream makes it no longer match the Regex. I'd use this function to match simple tokens like integers, strings, etc.

I tried implementing it in the following kind of way (this is not the exact version, but it explains the issue):

acc.push(self.peek())
if !regex.is_match() {
    acc.pop();
    return self.clone(), acc;
}
return self.consume_while_regex_matches(regex, acc);

The problem with this code is that it is testing whether acc matches the whole regex. Imagine if we want to consume a string -42 with a regex like ^-[0-9]+$. The algorithm would read the very first char -, the match would fail and the accumulator is going to be empty.

Is there a way to check that a string (e.g. acc) is a prefix of a potential regex match? Like - is not a match on its own, but -42 is a match and - is a valid prefix. And it'd be great if it's like a library way and it doesn't require me to produce my own regex engine.

Update: I'm not using the described function for parsing. I use it for lexing. I am aware that regex is not enough to parse a complex language.

What I'm asking is whether I can use some regex lib to match tokens gradually as opposed to having to provide the whole string to match against. I'm looking for a way to check whether the underlying DFA is or isn't in the error state by the end of marching a string without having to write my own regex parser and DFA implementation. If this was possible, I'd pass the - to the integer regex, check that after marching it didn't end up in the ERROR state, and if so, it's a valid prefix.

Upvotes: 1

Views: 1992

Answers (2)

kmdreko
kmdreko

Reputation: 60493

Taking the question at face value. The crate regex-automata (maintained by one of the regex authors) provides some access to the lower level details of building and parsing regexes. In particular, you can access and drive the internal DFA (deterministic finite automata) yourself.

use regex_automata::{Regex, DFA}; // 0.1.9

#[derive(Copy, Clone, Debug, Eq, PartialEq)]
enum PotentialMatch {
    Match,
    CouldMatch,
    DoesntMatch,
}

fn potentially_matches(pattern: &Regex, input: &str) -> PotentialMatch {
    let input = input.as_bytes();
    let dfa = pattern.forward();
    let mut state = dfa.start_state();
    for byte in input {
        state = dfa.next_state(state, *byte);
        if dfa.is_dead_state(state) {
            return PotentialMatch::DoesntMatch;
        }
    }
    
    if dfa.is_match_state(state) {
        PotentialMatch::Match
    }
    else {
        PotentialMatch::CouldMatch
    }
}

fn main() {
    let pattern = Regex::new("-[0-9]+").unwrap();
    
    assert_eq!(potentially_matches(&pattern, ""), PotentialMatch::CouldMatch);
    assert_eq!(potentially_matches(&pattern, "-"), PotentialMatch::CouldMatch);
    assert_eq!(potentially_matches(&pattern, "-1"), PotentialMatch::Match);
    assert_eq!(potentially_matches(&pattern, "-12"), PotentialMatch::Match);
    assert_eq!(potentially_matches(&pattern, "-12a"), PotentialMatch::DoesntMatch);
}

You could probably integrate this state tracking into your implementation to be more performant over calling potentially_matches() over and over.

Upvotes: 1

CinchBlue
CinchBlue

Reputation: 6190

Regular expressions are basically a succinct format for state machines that work on inputs of characters/strings, whether finite-state machines or pushdown automata, depending on the complexity of the language you need to accept/reject.

Your approach is to build the string until it matches against a complex regex, but regexs are essentially often recompiled into state machines for efficiency. The way these state machines work is to break down the complex regex into individual states, and then, against these states, parse and transition between states token-by-token (or character-by-character). A final regex may have many different states. Here's a (somewhat improper) FSM example for [+-][1-9][0-9]* with s2 as the accept state:

a finite state machine with 3 states representing the regex above

Is there a way to check that a string (e.g. acc) is a prefix of a potential regex match? Like - is not a match on its own, but -42 is a match and - is a valid prefix. And it'd be great if it's like a library way and it doesn't require me to produce my own regex engine.

You could use pure regex for simpler parsing problems, but if you're building a programming language, you're eventually going to be looking for a parsing library to handle parsing to arbitrary nested depth (which requires a system stronger than regexes). In Rust, one of the most popular crates for this is nom. Nom is a parsing combinator-style library, which means it relies on putting smaller parsers together in different ways with "operators" called "combinators." Regular expressions are often a weaker form of parser than these due to the limited number and limited properties of their operators/combinators.

I will also note that the full job of validating a proper program written in a programming language won't be satisfied by either, which is why we bother going beyond the lexing/parsing process. All traditional and proper programming languages are somewhere between context-sensitive grammars and Turing-complete languages, which means parsing alone won't satisfy your needs because there is additional context and semantic information that needs to be checked by type checking, reference checking, etc.

Upvotes: 0

Related Questions