Reputation: 939
I'm working on a side project (trying to learn regex and get better at parsing in general) and trying to write a function that will validate if a string is valid under a particular grammar. The grammar is as follows:
statement -> delimeter token
delimeter -> / or -
token -> name ([check])* (delimeter token)?
check -> token
@id="..."
I've written out regular expressions for each of the above (except token), they are written below. However, when I tried to write out the token regex, I realized that it depends on itself (recursive). I'm not too sure how to write this regex or if this is even the correct way to go about it anymore, since check can go very deep potentially. Is there a better way to validate if the string can be represented by the grammar or no? If not, how do I do this with regex?
String delimeter = "/|-";
String name = "((?i)\\A[a-z][_a-z\\d\\-\\.]*){1}";
String checkToken = would just be equal to token;
String checkID = "(?i)\\A\\s*@id\\s*=\\s*\".*\"\\s*\\Z";
I'm using the String.matches call to see if a string matches the regex, right now just checking smaller things, like if a name is correct.
Upvotes: 2
Views: 2251
Reputation:
You are looking for a better understanding of the Chomsky hierarchy.
The simple form of the hierarchy has the following types:
The regular expression is a depiction of the finite state automaton which can match regular languages. If the language isn't regular, you risk summoning Tony the Pony when trying to match a non-regular language with a regular expression (this isn't a good thing).
A given tool for matching can match any language on its level or higher. So a non-deterministic pushdown automaton can match a context-free language and a regular language. But a finite state automata can only match a regular language.
Typically, in compiler design and the like, the lexer (working off of regular languages) is paired with a parser generator which works with context free languages. This can be seen with the paring of lex and yacc, or flex and bison.
Lex has a syntax that matches tokens and passes them on to yacc. In the modern Java world, you may wish instead to look at antlr - Another Tool for Language Recognition to help you with writing the parser. JavaCC also comes recommended (another tool that some like better, you should look into both of these if you intend to go down this path). Lex & Yacc, Antlr, and JavaCC are part of a domain of tools known as parser generators if you want a comparison of them.
I'd suggest giving Lex & Yacc Tutorial a read. While, yes, it is for lex and yacc which you aren't using, there is a section on the theory behind it (lexing and parsing). Understanding the theory will help you understand why your current approach isn't working.
Upvotes: 4
Reputation: 7380
Grammar with recursive definitions are generally not regular and can therefore not be parsed with regular expressions.
In your case however, it seems that you can convert the grammar to a regular form:
statement -> ( delimiter token )+
delimiter -> / or -
token -> name ([check])*
check -> token
@id="..."
Upvotes: 2
Reputation: 311052
trying to write a function that will validate if a string is valid under a particular grammar
Err, the parser is the function that does that. If it parses, it's valid. If it gets a syntax error, it isn't. And this is validating a string, not validating the grammar itself as per your title.
I've written out regular expressions for each of the above (except token), they are written below. However, when I tried to write out the token regex, I realized that it depends on itself (recursive). I'm not too sure how to write this regex or if this is even the correct way to go about it anymore, since check can go very deep potentially. Is there a better way to validate if the string can be represented by the grammar or no? If not, how do I do this with regex?
You don't.
You can't parse recursive grammars with regular expressions. Regular expressions are used to characterize the lexical analyser. The grammar will be a context-free grammar, either LL(1) or LR(1). If you don't know what these terms mean you have a lot of reading to do.
Upvotes: 1