Webucator
Webucator

Reputation: 2683

Need to find HTML pre tags that contain other tags

I have HTML content with <pre> tags that contain other tags. All the angle brackets in the <pre> content should be escaped using HTML entities. In other words, every < should become &lt; and every > should become &gt;.

For starters, I just want to be able to figure out which files have the offending content. Can anyone think of a way to do this using a regular expression:

BAD: RegEx should match this

<body>
    <h1>My Content</h1>
    <pre class="some-class">
        <foo>
            <bar>Content</bar>
            <script>
                alert('Hi!');
            </script>
        </foo>
        <br>
    </pre>

    <p>The middle</p>

    <pre class="other-class">
        <bar>
            <foo>Text</foo>
            <script>
                alert('Bye!');
            </script>
        </bar>
        <br>
    </pre>
    <p>The end</p>
</body>

GOOD: RegEx should not match this.

<body>
    <h1>My Content</h1>
    <pre class="some-class">
        &lt;foo&gt;
            &lt;bar&gt;Content&lt;/bar&gt;
            &lt;script&gt;
                alert('Hi!');
            &lt;/script&gt;
        &lt;/foo&gt;
        &lt;br&gt;
    </pre>

    <p>The middle</p>

    <pre class="other-class">
        &lt;bar&gt;
            &lt;foo&gt;Text&lt;/foo&gt;
            &lt;script&gt;
                alert('Bye!');
            &lt;/script&gt;
        &lt;/bar&gt;
        &lt;br&gt;
    </pre>
    <p>The end</p>
</body>

Upvotes: 0

Views: 115

Answers (2)

Webucator
Webucator

Reputation: 2683

Thanks to @Jens and @Joop, I used a solution that combines the JSoup parser and RegEx.

  1. Find all <pre> elements that contain child elements:

    Document doc = Jsoup.parse(html); Elements badPres = doc.select("pre:has(*)");

  2. Loop through those applying @Joop's RegEx solution.

Upvotes: 0

Joop Eggen
Joop Eggen

Reputation: 109547

To find the shortest match in a regex use .*?. Also to let the . match newline characters, one needs DOT_ALL, (?s).

Pattern prePattern = Pattern.compile("(?si)(<pre[^>]*>)(.*?)</pre>");
StringBuffer sb = new StringBuffer(html.length() + 1000);
Matcher m = prePattern.matcher(html);
while (m.find()) {
    String text = m.group(2);
    text = text.replace("<", "&lt;").replace(">", "&gt;");
    m.appendReplacement(sb, m.group(1) + text + "</pre>");
}
m.appendTail(sb);
html = sb.toString();

Upvotes: 1

Related Questions