coterobarros
coterobarros

Reputation: 1287

Regex expression detect <code>...</code> code chunks

I'm trying to detect <code>...</code> chunks inside an HTML source code file in order to remove them from the file. I am using Java 8 Pattern and Matcher classes to implement RegEx. For example, this method prints out every <code>...</code> finding.

protected void printSourceCodeChunks() {
  // Design a regular expression to detect code chunks
  String patternString = "<code>.*<\\/code>";
  Pattern pattern = Pattern.compile(patternString);
  Matcher matcher = pattern.matcher(source);
  
  // Loop over findings
  int i = 1;
  while (matcher.find())
    System.out.println(i++ + ": " + matcher.group());
}

A typical output would be:

1: <code> </code>
2: <code></code>
3: <code>System.out.println("Hello World");</code>

As I am using the special character dot and the source code chunks may include line breaks (\n or \r), no code blocks including line breaks will be detected. Fortunately Pattern class can be instructed to include line breaks into the meaning of dot, just adding

  Pattern pattern = Pattern.compile(patternString, Pattern.DOTALL);

The problem with this approach is that only one (fake) <code>...</code> block is detected: the one starting with the first occurrence of <code> and the last occurrence of </code> in the HTML file. The output includes now all the HTML code between these two tags.

How may I alter the regex expression to match every single code block?

Solution proposal

As many of you posted, and for the benefit of future readers, it was that easy as changing my regex by

<code>.*?<\\/code>

as * takes all chars up to the last </code> it finds.

Upvotes: 0

Views: 103

Answers (2)

baao
baao

Reputation: 73251

You don't use regex to manipulate html!

Instead, parse the html, for example with jsoup, and remove the elements properly.

String html = "<html><head><title>First parse</title></head>"
        + "<body><p>Parsed HTML into a doc.</p><code>foo</code><code></code><code> </code></body></html>";
Document doc = Jsoup.parse(html);
Elements codes = doc.body().getElementsByTag("code");
codes.remove();
System.out.println(doc.toString());

Upvotes: 4

Lino
Lino

Reputation: 19926

You can do that with the non-greedy ?:

String patternString = "<code>.*?<\\/code>"

By default the * will match everything it gets, from the first occurance of <code> to the last of </code>. With the questionmark ? it will stop matching at the first occurance.

Though I highly recommend to not "parse" any structure with regex, better use a dedicated HTML parser

Upvotes: 2

Related Questions