c12
c12

Reputation: 9827

Regular Expression breaking Due to line break character (\n)

I have a regex that is using a "'''.*?'''|'.*?'" pattern to look for text between tripple quotes (''') and single quotes ('). When carriage returns are added to the input String the regex pattern fails to read to the end of the triple quote. Any idea how to change the regex to read to the end of the triple tick and not break on the \n? The quoteMatcher.end() returns the value of 2 so the fail case below returns ''''''

Works:

'''<html><head></head></html>'''

Fails:

User Entered Value:

   '''<html>
    <head></head>
    </html>'''

Java Representation:

'''<html>\n<head></head>\n</html>'''

Parsing Logic:

public static final Pattern QUOTE_PATTERN = Pattern.compile("'''.*?'''|'.*?'");


 Matcher quoteMatcher = QUOTE_PATTERN.matcher(value);
        int normalPos = 0, length = value.length();
        while (normalPos < length && quoteMatcher.find()) {
          int quotePos = quoteMatcher.start(), quoteEnd = quoteMatcher.end();
          if (normalPos < quotePos) {
            copyBuilder.append(stripHTML(value.substring(normalPos, quotePos)));
          }
          //quoteEnd fails to read to the end due to \n
          copyBuilder.append(value.substring(quotePos, quoteEnd));
          normalPos = quoteEnd;
        }
    if (normalPos < length) copyBuilder.append(stripHTML(value.substring(normalPos)));

Upvotes: 0

Views: 125

Answers (1)

Pierluc SS
Pierluc SS

Reputation: 3176

Simply use the Pattern.DOTALL modifier so the . also matches line breaks.

public static final Pattern QUOTE_PATTERN = Pattern.compile("'''.*?'''|'.*?'", Pattern.DOTALL);

Upvotes: 3

Related Questions