Reputation: 2683
I have HTML content with <pre>
tags that contain other tags. All the angle brackets in the <pre>
content should be escaped using HTML entities. In other words, every <
should become <
and every >
should become >
.
For starters, I just want to be able to figure out which files have the offending content. Can anyone think of a way to do this using a regular expression:
BAD: RegEx should match this
<body>
<h1>My Content</h1>
<pre class="some-class">
<foo>
<bar>Content</bar>
<script>
alert('Hi!');
</script>
</foo>
<br>
</pre>
<p>The middle</p>
<pre class="other-class">
<bar>
<foo>Text</foo>
<script>
alert('Bye!');
</script>
</bar>
<br>
</pre>
<p>The end</p>
</body>
GOOD: RegEx should not match this.
<body>
<h1>My Content</h1>
<pre class="some-class">
<foo>
<bar>Content</bar>
<script>
alert('Hi!');
</script>
</foo>
<br>
</pre>
<p>The middle</p>
<pre class="other-class">
<bar>
<foo>Text</foo>
<script>
alert('Bye!');
</script>
</bar>
<br>
</pre>
<p>The end</p>
</body>
Upvotes: 0
Views: 115
Reputation: 2683
Thanks to @Jens and @Joop, I used a solution that combines the JSoup parser and RegEx.
Find all <pre> elements that contain child elements:
Document doc = Jsoup.parse(html); Elements badPres = doc.select("pre:has(*)");
Loop through those applying @Joop's RegEx solution.
Upvotes: 0
Reputation: 109547
To find the shortest match in a regex use .*?
.
Also to let the .
match newline characters, one needs DOT_ALL, (?s)
.
Pattern prePattern = Pattern.compile("(?si)(<pre[^>]*>)(.*?)</pre>");
StringBuffer sb = new StringBuffer(html.length() + 1000);
Matcher m = prePattern.matcher(html);
while (m.find()) {
String text = m.group(2);
text = text.replace("<", "<").replace(">", ">");
m.appendReplacement(sb, m.group(1) + text + "</pre>");
}
m.appendTail(sb);
html = sb.toString();
Upvotes: 1