Elist
Elist

Reputation: 5533

Escape each literal in regex string instead of quote the entire string

The answers here suggesting to use Pattern.quote in order to escape the special regex characters.

The problem with Pattern.quote is it escapes the string as a whole, not each of the special character on its own.

This is my case:
I receive a string from the user, and need to search for it in a document. Since the user can't pass new line characters (It's a bug in a 3rd party API I have no access to), I decieded to treat any whitespace sequence as "\s+" and use a regex to search the document. This way the user can send a simple whitespace instead of a newline character.

For instance, if the document is:

The \s metacharacter is used to find a whitespace character.

A whitespace character can be:

  • A space character
  • A tab character
  • A carriage return character
  • A new line character
  • A vertical tab character
  • A form feed character

  • Then the received string

    String receivedStr = "The \s metacharacter is used to find a whitespace character. A whitespace character can be:";
    

    should be found in the document.

    To acheive this I want to quote the string, and then replace any whitespace sequence with the string "\s+".
    Using the following code:

    receivedStr = Pattern.quote(receivedStr).replaceAll("\\s+", "\\\\s+");
    

    yield the regex:

    \QThe\s+\s\s+metacharacter\s+is\s+used\s+to\s+find\s+a\s+whitespace\s+character.\s+A\s+whitespace\s+character\s+can\s+be:\E

    that will ofcourse ignore my added "\s+"'s instead of the expected:

    The\s+\\s\s+metacharacter\s+is\s+used\s+to\s+find\s+a\s+whitespace\s+character.\s+A\s+whitespace\s+character\s+can\s+be:

    that only escapes the "\s" literal and not the entire string.

    Is there an alternative to Pattern.quote that escapes single literals instead of the whole string?

    Upvotes: 1

    Views: 342

    Answers (1)

    aioobe
    aioobe

    Reputation: 421360

    I would suggest something like this:

    String re = Stream.of(input.split("\\s+"))
                      .map(Pattern::quote)
                      .collect(Collectors.joining("\\s+"));
    

    This makes sure everything gets quoted (including stuff that otherwise would be interpreted as look-arounds and could cause exponential blowup in match finding), and any user entered whitespace ends up as unquoted \s+.

    Example input:

    Lorem \\b ipsum \\s dolor (sit) amet.
    

    Output:

    \QLorem\E\s+\Q\b\E\s+\Qipsum\E\s+\Q\s\E\s+\Qdolor\E\s+\Q(sit)\E\s+\Qamet.\E
    

    Upvotes: 2

    Related Questions