Reputation: 1256

Select the first paragraph tag not contained in within another tag using RegEx (Perl-style)

I have this block of html:

<div>
  <p>First, nested paragraph</p>
</div>
<p>First, non-nested paragraph.</p>
<p>Second paragraph.</p>
<p>Last paragraph.</p>

I'm trying to select the first, non-nested paragraph in that block. I'm using PHP's (perl style) preg_match to find it, but can't seem to figure out how to ignore the p tag contained within the div.

This is what I have so far, but it selects the contents of the first paragraph contained above.

/<p>(.+?)<\/p>/is

Thanks!

EDIT

Unfortunately, I don't have the luxury of a DOM Parser.

I completely appreciate the suggestions to not use RegEx to parse HTML, but that's not really helping my particular use case. I have a very controlled case where an internal application generated structured text. I'm trying to replace some text if it matches a certain pattern. This is a simplified case where I'm trying to ignore text nested within other text and HTML was the simplest case I could think of to explain. My actual case looks something a little more like this (But a lot more data and minified):

#[BILLINGCODE|12345|11|15|2001|15|26|50]#
[ITEM1|{{Escaped Description}}|1|1|4031|NONE|15]
#[{{Additional Details }}]#
[ITEM2|{{Escaped Description}}|3|1|7331|NONE|15]
[ITEM3|{{Escaped Description}}|1|1|9431|NONE|15]
[ITEM4|{{Escaped Description}}|1|1|5131|NONE|15]

I have to reformat a certain column of certain rows to a ton of rows similar to that. Helping my first question would help actual project.

Upvotes: 0

Answers (5)

kgoedtel

Reputation: 31

How about something like this?

<p>([^<>]+)<\/p>(?=(<[^\/]|$))

Does a look-ahead to make sure it is not inside a closing tag; but can be at the end of a string. There is probably a better way to look for what is in the paragraph tags but you need to avoid being too greedy (a .+? will not suffice).

Upvotes: 2

Charles Sprayberry

Reputation: 7853

"You shouldn't use regex to parse HTML."

It is what everybody says but nobody really offers an example of how to actually do it, they just preach it. Well, thanks to some motivation from Levi Morrison I decided to read into DomDocument and figure out how to do it.

To everybody that says "Oh, it is too hard to learn the parser, I'll just use regex." Well, I've never done anything with DomDocument or XPath before and this took me 10 minutes. Go read the docs on DomDocument and parse HTML the way you're supposed to.

$myHtml = <<<MARKUP
   <html>
       <head>
            <title>something</title></head>
       <body>
            <div>
                <p>not valid</p>
            </div>
            <p>is valid</p>
            <p>is not valid</p>
            <p>is not valid either</p>
            <div>
                <p>definitely not valid</p>
            </div>
       </body>
   </html>
MARKUP;

$DomDocument = new DOMDocument();
$DomDocument->loadHTML($myHtml);
$DomXPath = new DOMXPath($DomDocument);
$nodeList = $DomXPath->query('body/p');
$yourNode = $DomDocument->saveHtml($nodeList->item(0));

var_dump($yourNode)

// output '<p>is valid</p>'

Upvotes: 1

mob

Reputation: 118635

Use a ~~two~~ three step process. First, pray that everything is well formed. Second, ~~First,~~ remove everything that is nested.

s{<div>.*?</div>}{}g;         # HTML example
s/#.*?#//g;                   # 2nd example

Then get your result. Everything that is left is now not nested.

$result = m{<p>(.*?)</p>};    # HTML example
$result = m{\[(.*?)\]};       # 2nd example

(this is Perl. Don't know how different it would look in PHP).

Upvotes: 1

Mr. Llama

Reputation: 20899

You might want to have a look at this post about parsing HTML with Regex.

Because HTML is not a regular language (and Regular Expressions are), you can't pares out arbitrary chunks of HTML using Regex. Use an HTML parser, it'll get the job done considerably more smoothly than trying to hack together some regex.

Upvotes: 0

fge

Reputation: 121780

Your regex won't work. Even if you had only non nested paragraph, your capturing parentheses would match First, non-nested ... Last paragraph..

Try:

<([^>]+)>([^<]*<(?!/?\1)[^<]*)*<\1>

and grab \2 if \1 is p.

But an HTML parser would do a better job of that imho.

Upvotes: 2

Select the first paragraph tag not contained in within another tag using RegEx (Perl-style)

Answers (5)

Related Questions