Reputation: 13
I'm doing a project at uni where I have to clean some HTML code with using regex (I know, not the best approach...)
Input of body:
<h1>This is heading 1</h1>
<h2 style="color: aqua">This is heading 2</h2>
<h3>This is heading 3</h3>
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
<a href="https://www.w3schools.com">This is a link</a>
<ul>
<li>Coffee</li>
<li>Tea</li>
<li>Milk</li>
</ul>
I get a list of allowed tags and I have to remove every other tag with it's content as well. for example {h3, p, ul}
First I strip all parameters (theyre not allowed), then I came up with this regex, that removes tags and content.
String regex = "(?i)<([h3|ul|p]+)>\\n?.*\\n?<\\/\\1>";
It works, but now I have to negate it and remove all tags and content except those given in...
I tried this, but doesn't work :
`...[?!h3|ul|p]...`
Desired outcome for this example:
<h3>This is heading 3</h3>
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
<ul>
</ul>
Don't really understand the Negative Lookahead and how to apply it to my problem, so I'll be thankful for any advice.
Upvotes: 1
Views: 857
Reputation: 18357
The negative look ahead you are trying to use needs to be written as (?!(?:h3|ul|p)\b)
which will not select either h3
or ul
or p
tag. Notice the use of word boundary \b
after it so as to reject exact matches of those tags. And besides removing those tags, you will also have to remove the whitespaces left behind after removing those tags, hence overall the regex you need to use is this,
\h*<(?!(?:h3|ul|p)\b)([^>]+).*?>[\w\W]*?</\1>\s*
Regex Explanation:
\h*
- Matches zero or more horizontal whitespace (space and tabs and may be other that exists) before the tag<
- Start of tag(?!(?:h3|ul|p)\b)
- Negative lookahead to exactly reject h3
ul
and p
tags([^>]+)
- Matches a tag name one or more characters and captures in group1 for back referencing it later. You can use something like \w+
or a character set with allowed characters to only match what you want..*?>
- Optionally match zero or more characters (basically attributes) followed by closing of start tag with >
[\w\W]*?
- Matches any character zero or more including newlines in non-greedy way</\1>
- Matches the closing of tag where \1
represents what matched earlier as tag name\s*
- Matches zero or more whitespace which basically consumes the empty space created by removal of tagsJava Code demo,
String s = "<h1>This is heading 1</h1>\r\n" +
"<h2 style=\"color: aqua\">This is heading 2</h2>\r\n" +
"<h3>This is heading 3</h3>\r\n" +
"<p>This is a paragraph.</p>\r\n" +
"<p>This is another paragraph.</p>\r\n" +
"<a href=\"https://www.w3schools.com\">This is a link</a>\r\n" +
"<ul>\r\n" +
" <li>Coffee</li>\r\n" +
" <li>Tea</li>\r\n" +
" <li>Milk</li>\r\n" +
"</ul>";
System.out.println("Before:\n" + s);
System.out.println("\nAfter:\n" + s.replaceAll("\\h*<(?!(?:h3|ul|p)\\b)([^>]+).*?>[\\w\\W]*?</\\1>\\s*", ""));
Output,
Before:
<h1>This is heading 1</h1>
<h2 style="color: aqua">This is heading 2</h2>
<h3>This is heading 3</h3>
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
<a href="https://www.w3schools.com">This is a link</a>
<ul>
<li>Coffee</li>
<li>Tea</li>
<li>Milk</li>
</ul>
After:
<h3>This is heading 3</h3>
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
<ul>
</ul>
Upvotes: 1
Reputation: 27743
You might want to extract those which you want to be in your desired output. This expression might be a better choice to do so and it can be modified, if you wish:
(<(p|h3.*)>.*<\/(.*)>)|(<(ul.*)>[\s\S]*<\/(ul)>)
It has two groups, one for p and h3 and the other for ul, which you can wrap them to another capturing group:
((<(p|h3.*)>.*<\/(.*)>)|(<(ul.*)>[\s\S]*<\/(ul)>))
If this wasn't your desired expression, you can modify/change your expressions in regex101.com.
You can also visualize your expressions in jex.im:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "((<(p|h3.*)>.*<\\/(.*)>)|(<(ul.*)>[\\s\\S]*<\\/(ul)>))";
final String string = "<h1>This is heading 1</h1>\n"
+ "<h2 style=\"color: aqua\">This is heading 2</h2>\n"
+ "<h3>This is heading 3</h3>\n"
+ "<p>This is a paragraph.</p>\n"
+ "<p>This is another paragraph.</p>\n"
+ "<a href=\"https://www.w3schools.com\">This is a link</a>\n"
+ "<ul>\n"
+ " <li>Coffee</li>\n"
+ " <li>Tea</li>\n"
+ " <li>Milk</li>\n"
+ "</ul>";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
const regex = /((<(p|h3.*)>.*<\/(.*)>)|(<(ul.*)>[\s\S]*<\/(ul)>))/gm;
const str = `<h1>This is heading 1</h1>
<h2 style="color: aqua">This is heading 2</h2>
<h3>This is heading 3</h3>
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
<a href="https://www.w3schools.com">This is a link</a>
<ul>
<li>Coffee</li>
<li>Tea</li>
<li>Milk</li>
</ul>`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
This expression might only capture your desired output. It does not follow a negation strategy.
Upvotes: 1