Rtra
Rtra

Reputation: 522

php DOM extract links from specific table

In my code I want to extract all links and their text from my old website I am successful to do it but the problem is somewhere I have used ol>li tags and somewhere I used ul>li tags inside table and I have about 400 different pages I can extract all the links but I have to change ol to ul every time so the easiest and time saving way for me to extract links and their text from all pages is to define that specific <table> which contains links but when I define <table> it also extract links from all others from other tables which I don't want

Table Structure to target that contains ol>li or ul>li tags

<table style="width:850px;" cellspacing="0" cellpadding="1" border="3">
    <tbody>
        <tr>
        <td style="text-align: center; background-color: rgb(51, 51, 204);">
            <h1>My Links</h1>
        </td>
        </tr>
        <tr>
            <td>
                <ol>
                    <li><a href="http://websitelink.com/page1.php">Page 1</a></li>
                    <li><a href="http://websitelink.com/page2.php">Page 2</a></li>
                    <li><a href="http://websitelink.com/page3.php">Page 3</a></li>
                    <li><a href="http://websitelink.com/page4.php">Page 4</a></li>
                </ol>
                ...
                <ul>
                    <li><a href="http://websitelink.com/a.php">Page A</a></li>
                    <li><a href="http://websitelink.com/b.php">Page B</a></li>
                    <li><a href="http://websitelink.com/c.php">Page C</a></li>
                    <li><a href="http://websitelink.com/d.php">Page D</a></li>
                </ul>
            </td>
        </tr>
    </tbody>
</table>

My Current PHP Code

$html = file_get_contents('http://mywebsitelink.com/pagename.html');
$dom = new DOMDocument;
@$dom->loadHTML($html);
$oltags = $dom->getElementsByTagName('ol'); // I have to change between ul and ol instead of this I can define table

foreach ($oltags as $list){
    $links =  $list->getElementsByTagName('a');
    foreach ($links as $href){
    $text = $href->nodeValue;
    $href = $href->getAttribute('href');
    if(!empty($text) && !empty($href)) {
    echo "Link Title:     " . $text . "       Location:     " . $href . "<br />";
    }
    }

}

Upvotes: 0

Views: 597

Answers (2)

deg
deg

Reputation: 445

$html = file_get_contents('http://mywebsitelink.com/pagename.html');
$dom = new DOMDocument;
@$dom->loadHTML($html);

$xpath = new DOMXpath($dom);

$thetags = $xpath->query('//table/tbody/tr/td/ol/li/a|//table/tbody/tr/td/ul/li/a');

foreach($thetags as $onetag)
{
    $links =  $onetag->getElementsByTagName('a');

    foreach ($links as $onelink){
        $text = $onelink->nodeValue;
        $href = $onelink->getAttribute('href');
        if(!empty($text) && !empty($href)) {
            echo "Link Title:     " . $text . "       Location:     " . $href . "<br />";
        }
    }
}
[...]

Upvotes: 0

Sahil Gulati
Sahil Gulati

Reputation: 15141

You can try this one. Here we are using DOMDocument and doing DOMXPath query over anchors present in li

XPath query //table/tbody/tr/td/ol/li/a|//table/tbody/tr/td/ul/li/a here we are searching for //table/tbody/tr/td/ol/li/a or //table/tbody/tr/td/ul/li/a with | operator.

Try this code snippet here

$links=array();
$domDocument = new DOMDocument();
$domDocument->loadHTML($string);

$domXPath = new DOMXPath($domDocument);
$results = $domXPath->query("//table/tbody/tr/td/ol/li/a|//table/tbody/tr/td/ul/li/a"); //querying domdocument
foreach($results as $result)
{
    $links[]=$result->getAttribute("href");//gathering href attribute
}
print_r($links);

Upvotes: 1

Related Questions