Reputation: 597
<!-- This Div repeated in HTML with different properties value -->
<div style="position:absolute; overflow:hidden; left:220px; top:785px; width:347px; height:18px; z-index:36">
<!-- Only Unique Thing is This in few pages -->
<a href="http://link.domain.com/?id=123" target="_parent">
<!-- OR in some pages Only Unique Thing is This, ending with mp3 extension -->
<a href="http://domain.com/song-title.mp3" target="_parent">
<!-- This Div also repeated multiple in HTML -->
<FONT style="font-size:10pt" color=#000000 face="Tahoma">
<DIV><B>Harjaiyaan</B> - Nandini Srikar</DIV>
</FONT>
</a>
</DIV>
We have very dirty html markup, its generated by some program or application. We want to extract 'Urls' from this code and as well as 'Text'.
in a href
we are using two types urls, Url 1 pattern: 'http://link.domain.com/id=123', Url 2 pattern: 'http://domain.com/sons-title.mp3'
in the first match, we are specific pattern for but in second url we have not pattern just urls ending with '.mp3' extension.
Is there some function to extract url
from this pattern and text code as well?
Note: without DOM, is there any way to match a href and between text with regular expression ? preg_match ?
Upvotes: 1
Views: 322
Reputation: 68556
Make use of DOMDocument
Class and proceed like this.
$dom = new DOMDocument;
$dom->loadHTML($html); //<------- Pass ur HTML source here
foreach ($dom->getElementsByTagName('a') as $tag) {
echo $tag->getAttribute('href');
echo $tag->nodeValue; // to get the content in between of tags...
}
Upvotes: 2
Reputation: 2441
Expanding on @Shankar Damodaran's answer:
$html = file_get_contents('source.htm');
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $tag) {
if (strstr($tag->getAttribute('href'),'?id=') !== false) {
echo $tag->getAttribute('href') . "<br>\n";
}
}
Then do the same for the MP3:
$html = file_get_contents('source.htm');
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $tag) {
if (strstr($tag->getAttribute('href'),'.mp3') !== false) {
echo $tag->getAttribute('href') . "<br>\n";
}
}
Upvotes: 1