UpIX
UpIX

Reputation: 493

Append <li> innertext to php url scraper results

I have a list of links on one page:

<li><span><a href="https://site1.com">site1.com</a> : Description 1</span></li>
<li><span><a href="https://site2.com">site2.com</a> : Description 2</span></li>
<li><span><a href="https://site3.com">site3.com</a> : Description 3</span></li>
<li><span><a href="https://site4.com">site4.com</a> : Description 4</span></li>

I'm using php to take the links from one page and display them on another as such:

<?php
$urlContent = file_get_contents('https://www.example.com/');

$dom = new DOMDocument();
@$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");

for($i = 0; $i < $hrefs->length; $i++){
    $href = $hrefs->item($i);
    $url = $href->getAttribute('href');
    $url = filter_var($url, FILTER_SANITIZE_URL);
    if(!filter_var($url, FILTER_VALIDATE_URL) === false){
        echo '<a href="'.$url.'">'.$url.'</a><br />';
    }
}
?>

However, what I'm trying to figure out is how to include the description next to the link. here is one of my many attempts:

<?php
$urlContent = file_get_contents('https://www.example.com');

$dom = new DOMDocument();
@$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a/li");
$li = document.getElementsByTagName("li");

for($i = 0; $i < $hrefs->length; $i++){
    $href = $hrefs->item($i);
    $url = $href->getAttribute('href');
    $url = filter_var($url, FILTER_SANITIZE_URL);
    if(!filter_var($url, FILTER_VALIDATE_URL) === false){
        echo '<a href="'.$url.'">'.$url.'</a> : '.$li.' <br />';
    }
}
?>

The first part works great but everything I have tried to add the description has failed.

Upvotes: 0

Views: 74

Answers (1)

u_mulder
u_mulder

Reputation: 54841

Here's a simple example according to current markup:

$dom = new DOMDocument();
@$dom->loadHTML($urlContent);
$xpath = new DOMXPath($dom);
$lis = $xpath->evaluate("/html/body/li");

foreach ($lis as $li) {
    $a = $xpath->evaluate("span/a", $li)->item(0);
    $url = $a->getAttribute('href');
    var_dump($url, $a->nextSibling->nodeValue);
}

Here nextSibling is text content, which follows <a> tag, so nextSibling->nodeValue will be " : Description", and you'll have to remove spaces and :, for example with trim.

Working fiddle.

Upvotes: 2

Related Questions