sairn
sairn

Reputation: 490

What regex expression will operate with Java's "replaceAll" function to remove the <p> html tag and its contents from an html string?

What regex expression will operate together with the Java replaceAll() method to remove the <p> html tag and its contents in between the tag from an HTML string?

For example, after applying the method,

"<div><p>table <b>test</b> title</p><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><p>miscellaneous contents</p><span>blah</span></div>"

becomes:

"<div><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><span>blah</span></div>"

Note: This is an "academic" exercise. I am not seeking a solution that uses an XML/HTML parser.


UPDATE:

Getting closer to a solution on this (thanks, jlordo!)... You pattern seems to work somewhat...

However, the suggested regex string ("<[pP]>.*?</[pP]>") does not appear to have an effect on a <p> tag that contains an attribute (i.e., in this case a "style" attribute) -- see below,

    public static void main(String[] args)
    {
        String htmlstring = "<div><p style='text-align: center; font-style: italic'>[click the <b>submit</b> button to create the new company.]</p><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><p>miscellaneous contents</p><span>blah</span></div>";
        htmlstring = htmlstring.replaceAll("<[pP]>.*?</[pP]>", "");
    }

htmlstring (before scrubbing):

<div><p style='text-align: center; font-style: italic'>[click the <b>submit</b> button to create the new company.]</p><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><p>miscellaneous contents</p><span>blah</span></div>

htmlstring (after scrubbing):

<div><p style='text-align: center; font-style: italic'>[click the <b>submit</b> button to create the new company.]</p><table><tbody><tr><td>this is table cell value</td></tr></tbody></table><span>blah</span></div>

Is there anything we can do to "tweak" it so that it handles this issue?

Upvotes: 0

Views: 3279

Answers (2)

Mike Samuel
Mike Samuel

Reputation: 120506

Pattern.compile(
  // A start p tag.
  "<p(?![a-z0-9:\\-])([^>\"']|\"[^\"]*\"|'[^']*)*>"
  + ".*?"   // Phrasing content that does not handle comment, RCDATA or raw text boundaries
  // An end p tag
  + "</p(?![a-z0-9:\\-])[^>]*>",
  Pattern.DOTALL | Pattern.CASE_INSENSITIVE);

The Pattern.DOTALL flag will cause .*? to match newlines which is necessary because your original regex would not match any paragraph that contained a newline in its body.

The Pattern.CASE_INSENSITIVE flag is specified without Pattern.UNICODE_CASE because it's unnecessary and I'm not confident that Turkish case-folding wouldn't create a subtle maintenance hazard were this regex modified to deal with <i>.

The ([^>"']|"[^"]*"|'[^']*) part matches any tag body character or quoted attribute. It will misbehave on certain non-validating attribute names like <p ain't-this=confusing>. The attribute grammar is regular, but doing a full treatment of quote characters in attribute values vs names would hugely expand the size of this regex, and would not likely help since anything requiring a full treatment will have to deal with the fact that backticks can quote attributes on a few browsers which means that no single regular expression can find value boundaries for arbitrarily messy HTML.

The (?![a-z0-9:\\-]) makes sure the name of the tag is "p" and not "plaintext" or "p-" or "p:foo" or some other HTML identifier of which "p" is a prefix.

This may behave on some constructs like:

  • <p><!-- </p> -->Not an orphaned end tag</p>
  • <p><textarea>Not a paragraph</p></textarea></p>
  • <noscript><p>Not a paragraph contextually</p></noscript>
  • <p ain't-this=confusing>Foo</p> <p>Isn't recognized as separate</p>.
  • <p><script>alert("Not a real </p> tag");</script></p>

Upvotes: 1

Evgeniy Dorofeev
Evgeniy Dorofeev

Reputation: 136002

try

    htmlstring = htmlstring.replaceAll("(?i)<p.*?>.*?</p>", "");

note that (?i) means turn on case-insensitive flag

Upvotes: 1

Related Questions