user989818
user989818

Reputation:

regular expression to extract html tags

I have a xml inside content place holder that I need to get, like:

<asp:Content ID="Content2" ContentPlaceHolderID="header" runat="server">
    <div>
        <categories>
            <category>
                <name>item 1</name>
                <categories>
                    <category>
                        <name>item 1.1.</name>
                    </category>
                    <category>
                        <name>item 1.2.</name>
                    </category>
                </categories>
            </category>
        </categories>
    </div>
</asp:Content>

And so on. I ll build the proper html using LINQ to XML over the root categories, but I'm failing to extract all the xml with regular expression. Is there a better way to extract the xml?

Upvotes: 1

Views: 471

Answers (2)

zx81
zx81

Reputation: 41838

The following regex matches your xml. It also captures everything inside the asp:content tags and places it in Group 1.

(?s)<asp:Content ID="[^"]*"\W+ContentPlaceHolderID="[^"]*"\W+runat="[^"]*">(.*?)</asp:Content>

Note that (?s) is the inline modifier that turns on the "dot matches new line" mode in certain regex flavors, such as .NET, Java, Perl, Python, PCRE for PHP's preg functions.

If you are using a different regex flavor, you will need to remove (?s) and activate "dot matches new line" differently.

The following code retrieves the group captures. To show a general solution, the subject string contains two of these placeholders.

<?php
$subject='
<asp:Content ID="blah" ContentPlaceHolderID="blah" runat="blah">Capture Me!</asp:Content>
<asp:Content ID="Content2" ContentPlaceHolderID="header" runat="server">
<div>
<categories>
<category>
     <name>item 1</name>
            <categories>
                <category>
                    <name>item 1.1.</name>
                </category>
                <category>
                    <name>item 1.2.</name>
                </category>
            </categories>
        </category>
    </categories>
</div>
</asp:Content>
';

preg_match_all('%(?s)<asp:Content ID="[^"]*"\W+ContentPlaceHolderID="[^"]*"\W+runat="[^"]*">(.*?)</asp:Content>%', $subject, $result,PREG_OFFSET_CAPTURE | PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result); $i++) {
echo "Capture number: ".$i."<br />".htmlentities($result[1][$i][0])."<br /><br />"; 
// echo "Match number: ".$i."<br />".htmlentities($result[0][$i][0])."<br /><br/>"; 
}
?>

Here is the output:

Capture number: 0
Capture Me!

Capture number: 1
<div> <categories> <category> <name>item 1</name> <categories> <category> <name>item   1.1.</name> </category> <category> <name>item 1.2.</name> </category> </categories> </category> </categories> </div> 

If you also want to display the whole match (not just the capture), just uncomment the second echo line in the for loop.

I think this is what you were looking for?

Upvotes: 0

FailedDev
FailedDev

Reputation: 26930

See Reading XML documents using LINQ to XML and XML Made Easy with LINQ to XML

Does it matter if the .xml is surrounded? Just give the root to Linq and work your way through it. Simple, robust and easy to maintain. In general don't even think about doing what you are about to do.

Upvotes: 1

Related Questions