Unescaped java not matching in regex matcher.find()

Question

I have the following code that basically matches "Match this:" and keeps the first sentence. However, there are sometimes unicode characters that get passed into the text that are causing backtracking on other more complicated regex's. Escaping seem to alleviate the backtracking index out of range exceptions. However, now the regex isn't matching.

What i would like to know is why this regex isn't matching when escaped? If you comment out the escape/unescape java lines everything.

    String text = "Keep this

"
            + "Match this:

Delete 📱 this";
    text = org.apache.commons.lang.StringEscapeUtils.escapeJava(text);
    Pattern PATTERN = Pattern.compile("^Match this:$",
            Pattern.MULTILINE);
    Matcher m = PATTERN.matcher(text);
    if (m.find()) {
        text = text.substring(0, m.start()).replaceAll("[\n]+$", "");
    }
    text = org.apache.commons.lang.StringEscapeUtils.unescapeJava(text);
    System.out.println(text);

Pshemo · Accepted Answer

What i would like to know is why this regex isn't matching when escaped?

When you escape string like "foo bar" which printed is similar to

foo
bar

you are getting "foo\nbar" which printed looks like

foo
bar

It happens because StringEscapeUtils.escapeJava escapes also and is replacing it with \n, so it is no longer line separator but simple literal, so it can't be matched with ^ or $.

Possible solution could be replacing back "\n" with " " after StringEscapeUtils.escapeJava. You will need to be careful here, not to "unescapee" real "\n" which after replacing would give you "\\n" which printed would look like \n. So maybe use

text = org.apache.commons.lang3.StringEscapeUtils.escapeJava(text);
text = text.replaceAll("(?





Another option could be creating your own implementation similar to StringEscapeUtils.escapeJava. If you take a look at this method body you will see 

return ESCAPE_JAVA.translate(input);


Where ESCAPE_JAVA is

CharSequenceTranslator ESCAPE_JAVA = 
  new LookupTranslator(
    new String[][] { 
      {""", "\""},
      {"\", "\\"},
  }).with(
    new LookupTranslator(EntityArrays.JAVA_CTRL_CHARS_ESCAPE())
  ).with(
    UnicodeEscaper.outsideOf(32, 0x7f) 
);


and EntityArrays.JAVA_CTRL_CHARS_ESCAPE() returns clone of 

String[][] JAVA_CTRL_CHARS_ESCAPE = {
    {"\b", "\b"},
    {"
", "\n"},
    {"	", "\t"},
    {"\f", "\f"},
    {"
", "\r"}
};


array. So if you provide here your own table which will tell explicitly that 
 should be left as it is (so it should be replaced with itself 
) your code will ignore it.

So this is how your own implementation can look like 

private static CharSequenceTranslator translatorIgnoringLineSeparators = 
    new LookupTranslator(
        new String[][] { 
                { """, "\"" }, 
                { "\", "\\" }, 
        }).with(
                new LookupTranslator(new String[][] {
                        { "\b", "\b" },
                        { "
", "
"  },//this will handle `
` and will not change it
                        { "
", "
"  },//this will handle `
` and will not change it
                        { "	", "\t" }, 
                        { "\f", "\f" },
        })).with(UnicodeEscaper.outsideOf(32, 0x7f));

public static String myJavaEscaper(CharSequence input) {
    return translatorIgnoringLineSeparators.translate(input);
}


This method will prevent escaping 
 and 
.

Unescaped java not matching in regex matcher.find()

Answers (1)

Related Questions