Xetius
Xetius

Reputation: 46844

Regex to match all HTML tags except <p> and </p>

I need to match and remove all tags using a regular expression in Perl. I have the following:

<\\??(?!p).+?>

But this still matches with the closing </p> tag. Any hint on how to match with the closing tag as well?

Note, this is being performed on xhtml.

Upvotes: 24

Views: 43831

Answers (14)

Adebowale
Adebowale

Reputation: 39

This works for me because all the solutions above failed for other html tags starting with p such as param pre progress, etc. It also takes care of the html attributes too.

~(<\/?[^>]*(?<!<\/p|p)>)~ig

Upvotes: 0

zx81
zx81

Reputation: 41838

Xetius, resurrecting this ancient question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)

With all the disclaimers about using regex to parse html, here is a simple way to do it.

#!/usr/bin/perl
$regex = '(<\/?p[^>]*>)|<[^>]*>';
$subject = 'Bad html <a> </I> <p>My paragraph</p> <i>Italics</i> <p class="blue">second</p>';
($replaced = $subject) =~ s/$regex/$1/eg;
print $replaced . "\n";

See this live demo

Reference

How to match pattern except in situations s1, s2, s3

How to match a pattern unless...

Upvotes: 3

y_nk
y_nk

Reputation: 2275

I used Xetius regex and it works fine. Except for some flex generated tags which can be :
with no spaces inside. I tried ti fix it with a simple ? after \s and it looks like it's working :

<(?!\/?p(?=>|\s?.*>))\/?.*?>

I'm using it to clear tags from flex generated html text so i also added more excepted tags :

<(?!\/?(p|a|b|i|u|br)(?=>|\s?.*>))\/?.*?>

Upvotes: 5

moritz
moritz

Reputation: 12852

The original regex can be made to work with very little effort:

 <(?>/?)(?!p).+?>

The problem was that the /? (or \?) gave up what it matched when the assertion after it failed. Using a non-backtracking group (?>...) around it takes care that it never releases the matched slash, so the (?!p) assertion is always anchored to the start of the tag text.

(That said I agree that generally parsing HTML with regexes is not the way to go).

Upvotes: 1

J&#246;rg W Mittag
J&#246;rg W Mittag

Reputation: 369526

In my opinion, trying to parse HTML with anything other than an HTML parser is just asking for a world of pain. HTML is a really complex language (which is one of the major reasons that XHTML was created, which is much simpler than HTML).

For example, this:

<HTML /
  <HEAD /
    <TITLE / > /
    <P / >

is a complete, 100% well-formed, 100% valid HTML document. (Well, it's missing the DOCTYPE declaration, but other than that ...)

It is semantically equivalent to

<html>
  <head>
    <title>
      &gt;
    </title>
  </head>
  <body>
    <p>
      &gt;
    </p>
  </body>
</html>

But it's nevertheless valid HTML that you're going to have to deal with. You could, of course, devise a regex to parse it, but, as others already suggested, using an actual HTML parser is just sooo much easier.

Upvotes: 16

John Siracusa
John Siracusa

Reputation: 15271

If you insist on using a regex, something like this will work in most cases:

# Remove all HTML except "p" tags
$html =~ s{<(?>/?)(?:[^pP]|[pP][^\s>/])[^>]*>}{}g;

Explanation:

s{
  <             # opening angled bracket
  (?>/?)        # ratchet past optional / 
  (?:
    [^pP]       # non-p tag
    |           # ...or...
    [pP][^\s>/] # longer tag that begins with p (e.g., <pre>)
  )
  [^>]*         # everything until closing angled bracket
  >             # closing angled bracket
 }{}gx; # replace with nothing, globally

But really, save yourself some headaches and use a parser instead. CPAN has several modules that are suitable. Here's an example using the HTML::TokeParser module that comes with the extremely capable HTML::Parser CPAN distribution:

use strict;

use HTML::TokeParser;

my $parser = HTML::TokeParser->new('/some/file.html')
  or die "Could not open /some/file.html - $!";

while(my $t = $parser->get_token)
{
  # Skip start or end tags that are not "p" tags
  next  if(($t->[0] eq 'S' || $t->[0] eq 'E') && lc $t->[1] ne 'p');

  # Print everything else normally (see HTML::TokeParser docs for explanation)
  if($t->[0] eq 'T')
  {
    print $t->[1];
  }
  else
  {
    print $t->[-1];
  }
}

HTML::Parser accepts input in the form of a file name, an open file handle, or a string. Wrapping the above code in a library and making the destination configurable (i.e., not just printing as in the above) is not hard. The result will be much more reliable, maintainable, and possibly also faster (HTML::Parser uses a C-based backend) than trying to use regular expressions.

Upvotes: 38

Kibbee
Kibbee

Reputation: 66132

You also might want to allow for whitespace before the "p" in the p tag. Not sure how often you'll run into this, but < p> is perfectly valid HTML.

Upvotes: 1

dbr
dbr

Reputation: 169623

Not sure why you are wanting to do this - regex for HTML sanitisation isn't always the best method (you need to remember to sanitise attributes and such, remove javascript: hrefs and the likes)... but, a regex to match HTML tags that aren't <p></p>:

(<[^pP].*?>|</[^pP]>)

Verbose:

(
    <               # < opening tag
        [^pP].*?    # p non-p character, then non-greedy anything
    >               # > closing tag
|                   #   ....or....
    </              # </
        [^pP]       # a non-p tag
    >               # >
)

Upvotes: 4

Xetius
Xetius

Reputation: 46844

I came up with this:

<(?!\/?p(?=>|\s.*>))\/?.*?>

x/
<           # Match open angle bracket
(?!         # Negative lookahead (Not matching and not consuming)
    \/?     # 0 or 1 /
    p           # p
    (?=     # Positive lookahead (Matching and not consuming)
    >       # > - No attributes
        |       # or
    \s      # whitespace
    .*      # anything up to 
    >       # close angle brackets - with attributes
    )           # close positive lookahead
)           # close negative lookahead
            # if we have got this far then we don't match
            # a p tag or closing p tag
            # with or without attributes
\/?         # optional close tag symbol (/)
.*?         # and anything up to
>           # first closing tag
/

This will now deal with p tags with or without attributes and the closing p tags, but will match pre and similar tags, with or without attributes.

It doesn't strip out attributes, but my source data does not put them in. I may change this later to do this, but this will suffice for now.

Upvotes: 14

Vegard Larsen
Vegard Larsen

Reputation: 13047

You should probably also remove any attributes on the <p> tag, since someone bad could do something like:

<p onclick="document.location.href='http://www.evil.com'">Clickable text</p>

The easiest way to do this, is to use the regex people suggest here to search for &ltp> tags with attributes, and replace them with <p> tags without attributes. Just to be on the safe side.

Upvotes: -1

Konrad Rudolph
Konrad Rudolph

Reputation: 545865

Try this, it should work:

/<\/?([^p](\s.+?)?|..+?)>/

Explanation: it matches either a single letter except “p”, followed by an optional whitespace and more characters, or multiple letters (at least two).

/EDIT: I've added the ability to handle attributes in p tags.

Upvotes: 0

Konrad Rudolph
Konrad Rudolph

Reputation: 545865

Since HTML is not a regular language

HTML isn't but HTML tags are and they can be adequatly described by regular expressions.

Upvotes: 2

DrPizza
DrPizza

Reputation: 18360

Since HTML is not a regular language I would not expect a regular expression to do a very good job at matching it. They might be up to this task (though I'm not convinced), but I would consider looking elsewhere; I'm sure perl must have some off-the-shelf libraries for manipulating HTML.

Anyway, I would think that what you want to match is </?(p.+|.*)(\s*.*)> non-greedily (I don't know the vagaries of perl's regexp syntax so I cannot help further). I am assuming that \s means whitespace. Perhaps it doesn't. Either way, you want something that'll match attributes offset from the tag name by whitespace. But it's more difficult than that as people often put unescaped angle brackets inside scripts and comments and perhaps even quoted attribute values, which you don't want to match against.

So as I say, I don't really think regexps are the right tool for the job.

Upvotes: 2

Brian Warshaw
Brian Warshaw

Reputation: 22974

Assuming that this will work in PERL as it does in languages that claim to use PERL-compatible syntax:

/<\/?[^p][^>]*>/

EDIT:

But that won't match a <pre> or <param> tag, unfortunately.

This, perhaps?

/<\/?(?!p>|p )[^>]+>/

That should cover <p> tags that have attributes, too.

Upvotes: 1

Related Questions