peter.murray.rust
peter.murray.rust

Reputation: 38073

generating a regular expression from a string

I wish to generate a regular expression from a string containing numbers, and then use this as a Pattern to search for similar strings. Example:

String s = "Page 3 of 23"

If I substitute all digits by \d

    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < s.length(); i++) {
    char c = s.charAt(i);
    if (Character.isDigit(c)) {
        sb.append("\\d"); // backslash d
    } else {
        sb.append(c);
        }
    }

    Pattern numberPattern = Pattern.compile(sb.toString());

//    Pattern numberPattern = Pattern.compile("Page \d of \d\d");

I can use this to match similar strings (e.g. "Page 7 of 47"). My problem is that if I do this naively some of the metacharacters such as (){}-, etc. will not be escaped. Is there a library to do this or an exhaustive set of characters for regular expressions which I must and must not escape? (I can try to extract them from the Javadocs but am worried about missing something).

Alternatively is there a library which already does this (I don't at this stage want to use a full Natural Language Processing solution).

NOTE: @dasblinkenlight's edited answer now works for me!

Upvotes: 10

Views: 1202

Answers (1)

Sergey Kalinichenko
Sergey Kalinichenko

Reputation: 726987

Java's regexp library provides this functionality:

String s = Pattern.quote(orig);

The "quoted" string will have all its metacharacters escaped. First, escape your string, and then go through it and replace digits by \d to make a regular expression. Since regex library uses \Q and \E for quoting, you need to enclose your portion of regex in inverse quotes of \E and \Q.

One thing I would change in your implementation is the replacement algorithm: rather than replacing character-by-character, I would replace digits in groups. This would let an expression produced from Page 3 of 23 match strings like Page 13 of 23 and Page 6 of 8.

String p = Pattern.quote(orig).replaceAll("\\d+", "\\\\E\\\\d+\\\\Q");

This would produce "\QPage \E\d+\Q of \E\d+\Q\E" no matter what page numbers and counts were there originally. The output needs only one, not two slashes in \d, because the result is fed directly to regex engine, bypassing the Java compiler.

Upvotes: 10

Related Questions