Reputation: 3222

Regex - I only want to match the start tags in regex

I am making a regex expression in which I only want to match wrong tags like:  *some text here, some other tags may be here as well but no ending 'p' tag* 

 <P>Affectionately Inscribed </P><P>TO </P><P>HENRY BULLAR, </P><P>(of the western circuit)<P>PREFACE</P>

In the above same text I want to get the result as (of the western circuit) and nothing else should be captured. I'm using this but its not working:

<P>[^\(</P>\)]*<P>

Please help.

Upvotes: 2

Answers (5)

Alan Moore

Reputation: 75222

All of the solutions offered so far match the second , but that's wrong. What if there are two consecutive elements without closing tags? The second one won't be matched because the first match ate its opening tag. You can avoid that problem by using a lookahead as I did here:

@"<p\b(?>(?:[^<]+|<(?!/?p>))*)(?=<p\b|$)"

As for the rest of it, I used a "not the initial or not the rest" technique along with an atomic group to guide the regex to a match as efficiently as possible (and, more importantly, to fail as quickly as possible if it's going to).

Upvotes: 0

Richard

Reputation: 108995

Rather than using * for maximal match, use *? for minimal.

Should be able to make a start with

<P>((?!</P>).)*?<P>

This uses a negative lookahead assertion to ensure the end tag is not matched at each point between the "" matches.

EDIT: Corrected to put assertion (thanks to commenter).

Upvotes: 0

David Dean

Reputation: 7701

I know this isn't likely (or even html-legal?) to happen in this case, but a generic unclosed xml-tag solution would be pretty difficult as you need to consider what would happen with nested tags like

<p>OUTER BEFORE<p>INNER</p>OUTER AFTER</p>

I'm pretty sure the regular expressions given so-far would match the second  there, even though it is not actually an unclosed .

Upvotes: 1

Tomalak

Reputation: 338178

Match group one of:

(?:<p>(?:(?!<\/?p>).?)+)(<p>)

matches the second  in:

<P>(of the western circuit)<P>PREFACE</P>

Note: I'm usually one of those that say: "Don't do HTML with regex, use a parser instead". But I don't think the specific problem can be solved with a parser, which would probably just ignore/transparently deal with the invalid markup.

Upvotes: 1

Marc Gravell

Reputation: 1062755

Regex is not always a good choice for xml/html type data. In particular, attributes, case-sensitivity, comments, etc all have a big impact.

For xhtml, I'd use XmlDocument/XDocument and an xpath query.

For "non-x" html, I'd look at the HTML Agility Pack and the same.

Upvotes: 7

Regex - I only want to match the start tags in regex

Answers (5)

Related Questions