Regex expression detect ... code chunks

Question

I'm trying to detect ... chunks inside an HTML source code file in order to remove them from the file. I am using Java 8 Pattern and Matcher classes to implement RegEx. For example, this method prints out every ... finding.

protected void printSourceCodeChunks() {
  // Design a regular expression to detect code chunks
  String patternString = ".*<\/code>";
  Pattern pattern = Pattern.compile(patternString);
  Matcher matcher = pattern.matcher(source);
  
  // Loop over findings
  int i = 1;
  while (matcher.find())
    System.out.println(i++ + ": " + matcher.group());
}


A typical output would be:
1:  
2: 
3: System.out.println("Hello World");

As I am using the special character dot and the source code chunks may include line breaks (
 or 
), no code blocks including line breaks will be detected. Fortunately Pattern class can be instructed to include line breaks into the meaning of dot, just adding
  Pattern pattern = Pattern.compile(patternString, Pattern.DOTALL);


The problem with this approach is that only one (fake) ... block is detected: the one starting with the first occurrence of  and the last occurrence of  in the HTML file. The output includes now all the HTML code between these two tags.
How may I alter the regex expression to match every single code block?
Solution proposal
As many of you posted, and for the benefit of future readers, it was that easy as changing my regex by
.*?<\/code>

as * takes all chars up to the last  it finds.

Lino · Accepted Answer

You can do that with the non-greedy ?:

String patternString = ".*?<\/code>"



By default the * will match everything it gets, from the first occurance of  to the last of . With the questionmark ? it will stop matching at the first occurance.

Though I highly recommend to not "parse" any structure with regex, better use a dedicated HTML parser

Regex expression detect <code>...</code> code chunks

Solution proposal

Answers (2)

Related Questions

Regex expression detect &lt;code&gt;...&lt;/code&gt; code chunks

Solution proposal

Answers (2)

Related Questions

Regex expression detect <code>...</code> code chunks