Reputation: 490
What regex expression will operate together with the Java replaceAll()
method to remove the <p>
html tag and its contents in between the tag from an HTML string?
For example, after applying the method,
"<div><p>table <b>test</b> title</p><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><p>miscellaneous contents</p><span>blah</span></div>"
becomes:
"<div><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><span>blah</span></div>"
Note: This is an "academic" exercise. I am not seeking a solution that uses an XML/HTML parser.
Getting closer to a solution on this (thanks, jlordo!)... You pattern seems to work somewhat...
However, the suggested regex string ("<[pP]>.*?</[pP]>"
) does not appear to have an effect on a <p>
tag that contains an attribute (i.e., in this case a "style" attribute) -- see below,
public static void main(String[] args)
{
String htmlstring = "<div><p style='text-align: center; font-style: italic'>[click the <b>submit</b> button to create the new company.]</p><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><p>miscellaneous contents</p><span>blah</span></div>";
htmlstring = htmlstring.replaceAll("<[pP]>.*?</[pP]>", "");
}
htmlstring (before scrubbing):
<div><p style='text-align: center; font-style: italic'>[click the <b>submit</b> button to create the new company.]</p><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><p>miscellaneous contents</p><span>blah</span></div>
htmlstring (after scrubbing):
<div><p style='text-align: center; font-style: italic'>[click the <b>submit</b> button to create the new company.]</p><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><span>blah</span></div>
Is there anything we can do to "tweak" it so that it handles this issue?
Upvotes: 0
Views: 3279
Reputation: 120506
Pattern.compile(
// A start p tag.
"<p(?![a-z0-9:\\-])([^>\"']|\"[^\"]*\"|'[^']*)*>"
+ ".*?" // Phrasing content that does not handle comment, RCDATA or raw text boundaries
// An end p tag
+ "</p(?![a-z0-9:\\-])[^>]*>",
Pattern.DOTALL | Pattern.CASE_INSENSITIVE);
The Pattern.DOTALL
flag will cause .*?
to match newlines which is necessary because your original regex would not match any paragraph that contained a newline in its body.
The Pattern.CASE_INSENSITIVE
flag is specified without Pattern.UNICODE_CASE
because it's unnecessary and I'm not confident that Turkish case-folding wouldn't create a subtle maintenance hazard were this regex modified to deal with <i>
.
The ([^>"']|"[^"]*"|'[^']*)
part matches any tag body character or quoted attribute. It will misbehave on certain non-validating attribute names like <p ain't-this=confusing>
. The attribute grammar is regular, but doing a full treatment of quote characters in attribute values vs names would hugely expand the size of this regex, and would not likely help since anything requiring a full treatment will have to deal with the fact that backticks can quote attributes on a few browsers which means that no single regular expression can find value boundaries for arbitrarily messy HTML.
The (?![a-z0-9:\\-])
makes sure the name of the tag is "p" and not "plaintext" or "p-" or "p:foo" or some other HTML identifier of which "p" is a prefix.
This may behave on some constructs like:
<p><!-- </p> -->Not an orphaned end tag</p>
<p><textarea>Not a paragraph</p></textarea></p>
<noscript><p>Not a paragraph contextually</p></noscript>
<p ain't-this=confusing>Foo</p> <p>Isn't recognized as separate</p>.
<p><script>alert("Not a real </p> tag");</script></p>
Upvotes: 1
Reputation: 136002
try
htmlstring = htmlstring.replaceAll("(?i)<p.*?>.*?</p>", "");
note that (?i) means turn on case-insensitive flag
Upvotes: 1