genxgeek
genxgeek

Reputation: 13367

Unescaped java not matching in regex matcher.find()

I have the following code that basically matches "Match this:" and keeps the first sentence. However, there are sometimes unicode characters that get passed into the text that are causing backtracking on other more complicated regex's. Escaping seem to alleviate the backtracking index out of range exceptions. However, now the regex isn't matching.

What i would like to know is why this regex isn't matching when escaped? If you comment out the escape/unescape java lines everything.

    String text = "Keep this\n\n"
            + "Match this:\n\nDelete 📱 this";
    text = org.apache.commons.lang.StringEscapeUtils.escapeJava(text);
    Pattern PATTERN = Pattern.compile("^Match this:$",
            Pattern.MULTILINE);
    Matcher m = PATTERN.matcher(text);
    if (m.find()) {
        text = text.substring(0, m.start()).replaceAll("[\\n]+$", "");
    }
    text = org.apache.commons.lang.StringEscapeUtils.unescapeJava(text);
    System.out.println(text);

Upvotes: 0

Views: 225

Answers (1)

Pshemo
Pshemo

Reputation: 124275

What i would like to know is why this regex isn't matching when escaped?

When you escape string like "foo\nbar" which printed is similar to

foo
bar

you are getting "foo\\nbar" which printed looks like

foo\nbar

It happens because StringEscapeUtils.escapeJava escapes also \n and is replacing it with \\n, so it is no longer line separator but simple literal, so it can't be matched with ^ or $.

Possible solution could be replacing back "\\n" with "\n" after StringEscapeUtils.escapeJava. You will need to be careful here, not to "unescapee" real "\\n" which after replacing would give you "\\\\n" which printed would look like \\n. So maybe use

text = org.apache.commons.lang3.StringEscapeUtils.escapeJava(text);
text = text.replaceAll("(?<!\\\\)\\\\n", "\n");// escape `\n` 
                                               // if it is not preceded with `\`
//do your job

//and now you can unescape your text (\n will stay \n)
text = org.apache.commons.lang3.StringEscapeUtils.unescapeJava(text);

Another option could be creating your own implementation similar to StringEscapeUtils.escapeJava. If you take a look at this method body you will see

return ESCAPE_JAVA.translate(input);

Where ESCAPE_JAVA is

CharSequenceTranslator ESCAPE_JAVA = 
  new LookupTranslator(
    new String[][] { 
      {"\"", "\\\""},
      {"\\", "\\\\"},
  }).with(
    new LookupTranslator(EntityArrays.JAVA_CTRL_CHARS_ESCAPE())
  ).with(
    UnicodeEscaper.outsideOf(32, 0x7f) 
);

and EntityArrays.JAVA_CTRL_CHARS_ESCAPE() returns clone of

String[][] JAVA_CTRL_CHARS_ESCAPE = {
    {"\b", "\\b"},
    {"\n", "\\n"},
    {"\t", "\\t"},
    {"\f", "\\f"},
    {"\r", "\\r"}
};

array. So if you provide here your own table which will tell explicitly that \n should be left as it is (so it should be replaced with itself \n) your code will ignore it.

So this is how your own implementation can look like

private static CharSequenceTranslator translatorIgnoringLineSeparators = 
    new LookupTranslator(
        new String[][] { 
                { "\"", "\\\"" }, 
                { "\\", "\\\\" }, 
        }).with(
                new LookupTranslator(new String[][] {
                        { "\b", "\\b" },
                        { "\n", "\n"  },//this will handle `\n` and will not change it
                        { "\r", "\r"  },//this will handle `\r` and will not change it
                        { "\t", "\\t" }, 
                        { "\f", "\\f" },
        })).with(UnicodeEscaper.outsideOf(32, 0x7f));

public static String myJavaEscaper(CharSequence input) {
    return translatorIgnoringLineSeparators.translate(input);
}

This method will prevent escaping \r and \n.

Upvotes: 3

Related Questions