johnnyodonnell
johnnyodonnell

Reputation: 2036

Why are Java regular expressions compiled?

I understand that Java regular expressions need to be compiled in order to do any type of regular expression pattern matching on strings, but I don't understand why they need to be compiled.

What the is the more efficient representation that a regex string is compiled to? And how is that representation more efficient than a String?

Upvotes: 4

Views: 915

Answers (2)

xtratic
xtratic

Reputation: 4699

What the is the more efficient representation that a regex string is compiled to? And how is that representation more efficient than a String?

Your question doesn't make much sense; I'm assuming you're asking about "compiled" vs "translated" RegEx which are basically the same thing anyway.

RegEx is essentially just a language for describing (to a computer) what you want to match.

Think of the RegEx string like code, the computer can't do anything with it until it's translated to something it can use first.

So "compiling" RegEx is just translating that string into machine instructions.

It is "more efficient than a String" because when you store the Pattern then you keep those instructions for matching to be re-used efficiently without having to re-translate the same RegEx string.

Eg. the below is (as I understand it) what you mean by compiled vs string representation:

public static void main(String[] args){
    // using "string representation"
    System.out.println("some string".matches("myRegex"));


    // using "compiled representation"
    Pattern myPattern = Pattern.compile("myRegex");

    Matcher myMatcher1 = myPattern.matcher("some string 1");
    Matcher myMatcher2 = myPattern.matcher("some string 2");

    System.out.println(myMatcher1.matches());
    System.out.println(myMatcher2.matches());
}

But if you look at the method used by "some string".matches("myRegex") then you'll see that it also calls Pattern.compile() to translate the RegEx instructions:

String.java

public boolean matches(String regex) {
    // return Pattern.matches(regex, this);
    Pattern p = Pattern.compile(regex);
    Matcher m = p.matcher(input);
    return m.matches();
}

So using the "string representation" still compiles the RegEx, it just doesn't cache the pattern for re-use.

Upvotes: 0

Jordan Kasper
Jordan Kasper

Reputation: 13273

In general, Regular Expression engines use a set of instructions to know how to walk through the target text and match portions of it. The high level (human-readable) pattern that we as developers write is like your source code in Java (or really any other language). The computer does not run your source code, it compiles that into instructions the computer can understand. Similarly, your RegEx pattern is compiled into a set of instructions that the RegEx engine (regardless of programming language) can process.

I personally find the Regular-Expressions.info site very helpful for lots of explanations, although their explanation of how the engine works internally is a bit light. This answer on SO is decent, with some other links.

If you want a more in depth answer, I would look at this page which talks about the nature of Regular Expression engines, which is that they are finite state machines.

Regular expression engines are implemented as finite state machines (FSM). The pattern you supply is compiled into a data structure that represents this state machine.

When you match a string against this pattern, the regex engine takes each character and decides the state transition within the FSM. If there are no valid state transitions for an input character the match fails.

One of the states in the FSM is a terminating/end state. If the regex engine gets there it reports success.

To answer your "how is that more efficient than a string" question, it can't be a string... you have to get the low-level instructions for the engine. A String type isn't a set of instructions!

Upvotes: 9

Related Questions