Ahmed iqbal
Ahmed iqbal

Reputation: 597

How to extract urls and text from html markup with regex

<!-- This Div repeated in HTML with different properties value -->

<div style="position:absolute; overflow:hidden; left:220px; top:785px; width:347px; height:18px; z-index:36">

<!-- Only Unique Thing is This in few pages -->
<a href="http://link.domain.com/?id=123" target="_parent">

<!-- OR in some pages Only Unique Thing is This, ending with mp3 extension -->
<a href="http://domain.com/song-title.mp3" target="_parent">

    <!-- This Div also repeated multiple in HTML -->

    <FONT style="font-size:10pt" color=#000000 face="Tahoma">
        <DIV><B>Harjaiyaan</B> - Nandini Srikar</DIV>
    </FONT>
</a>

</DIV>

We have very dirty html markup, its generated by some program or application. We want to extract 'Urls' from this code and as well as 'Text'.

in a href we are using two types urls, Url 1 pattern: 'http://link.domain.com/id=123', Url 2 pattern: 'http://domain.com/sons-title.mp3'

in the first match, we are specific pattern for but in second url we have not pattern just urls ending with '.mp3' extension.

Is there some function to extract url from this pattern and text code as well?

Note: without DOM, is there any way to match a href and between text with regular expression ? preg_match ?

Upvotes: 1

Views: 322

Answers (2)

Make use of DOMDocument Class and proceed like this.

$dom = new DOMDocument;
$dom->loadHTML($html); //<------- Pass ur HTML source here
foreach ($dom->getElementsByTagName('a') as $tag) {

        echo $tag->getAttribute('href');
        echo $tag->nodeValue; // to get the content in between of tags...

}

Upvotes: 2

Grant
Grant

Reputation: 2441

Expanding on @Shankar Damodaran's answer:

$html = file_get_contents('source.htm');

$dom = new DOMDocument;
$dom->loadHTML($html); 
foreach ($dom->getElementsByTagName('a') as $tag) {

    if (strstr($tag->getAttribute('href'),'?id=') !== false) {
        echo $tag->getAttribute('href') . "<br>\n";
    }

}

Then do the same for the MP3:

$html = file_get_contents('source.htm');

$dom = new DOMDocument;
$dom->loadHTML($html); 
foreach ($dom->getElementsByTagName('a') as $tag) {

    if (strstr($tag->getAttribute('href'),'.mp3') !== false) {
        echo $tag->getAttribute('href') . "<br>\n";
    }

}

Upvotes: 1

Related Questions