shabby
shabby

Reputation: 3222

Regex - I only want to match the start tags in regex

I am making a regex expression in which I only want to match wrong tags like: <p> *some text here, some other tags may be here as well but no ending 'p' tag* </p>

 <P>Affectionately Inscribed </P><P>TO </P><P>HENRY BULLAR, </P><P>(of the western circuit)<P>PREFACE</P>

In the above same text I want to get the result as <P>(of the western circuit)<P> and nothing else should be captured. I'm using this but its not working:

<P>[^\(</P>\)]*<P>

Please help.

Upvotes: 2

Views: 1550

Answers (5)

Alan Moore
Alan Moore

Reputation: 75222

All of the solutions offered so far match the second <P>, but that's wrong. What if there are two consecutive <P> elements without closing tags? The second one won't be matched because the first match ate its opening tag. You can avoid that problem by using a lookahead as I did here:

@"<p\b(?>(?:[^<]+|<(?!/?p>))*)(?=<p\b|$)"

As for the rest of it, I used a "not the initial or not the rest" technique along with an atomic group to guide the regex to a match as efficiently as possible (and, more importantly, to fail as quickly as possible if it's going to).

Upvotes: 0

Richard
Richard

Reputation: 108995

Rather than using * for maximal match, use *? for minimal.

Should be able to make a start with

<P>((?!</P>).)*?<P>

This uses a negative lookahead assertion to ensure the end tag is not matched at each point between the "<P>" matches.

EDIT: Corrected to put assertion (thanks to commenter).

Upvotes: 0

David Dean
David Dean

Reputation: 7701

I know this isn't likely (or even html-legal?) to happen in this case, but a generic unclosed xml-tag solution would be pretty difficult as you need to consider what would happen with nested tags like

<p>OUTER BEFORE<p>INNER</p>OUTER AFTER</p>

I'm pretty sure the regular expressions given so-far would match the second <p> there, even though it is not actually an unclosed <p>.

Upvotes: 1

Tomalak
Tomalak

Reputation: 338178

Match group one of:

(?:<p>(?:(?!<\/?p>).?)+)(<p>)

matches the second <p> in:

<P>(of the western circuit)<P>PREFACE</P>

Note: I'm usually one of those that say: "Don't do HTML with regex, use a parser instead". But I don't think the specific problem can be solved with a parser, which would probably just ignore/transparently deal with the invalid markup.

Upvotes: 1

Marc Gravell
Marc Gravell

Reputation: 1062755

Regex is not always a good choice for xml/html type data. In particular, attributes, case-sensitivity, comments, etc all have a big impact.

For xhtml, I'd use XmlDocument/XDocument and an xpath query.

For "non-x" html, I'd look at the HTML Agility Pack and the same.

Upvotes: 7

Related Questions