Graham
Graham

Reputation: 129

Cannot get just the second list on a page with PHP Simple DOM

I have this code to try and extract a list on a page:

$websiteURL = "https://waset.org/conferences-in-january-2022-in-tokyo";
$html = file_get_html($websiteURL);

foreach ( $html->find( 'ul') as $ul ) {
     foreach($ul->find('li') as $li) {
        echo "LI: " . $li . "<br>";
    }
}

This does what I expect it to do (that is display every <li> for ALL <ul>'s on the page.

However, If I replace the second foreach with (as I only want to get first list):

foreach ( $html->find( 'ul', 1) as $ul ) {

I get:

"Call to a member function find() on int"

... which suggests that find('ul', 1) did not return anything, but I don't know why?

Note: There are more than two lists on this page.

Anybody know what I am doing wrong?

Upvotes: 0

Views: 58

Answers (1)

miken32
miken32

Reputation: 42714

To answer your question "I suppose my bottom line question is how do I access all the <li>'s from the second on a web page?" using an API that is relatively modern, well-supported, and built into PHP:

<?php
$url = "https://waset.org/conferences-in-january-2022-in-tokyo";

libxml_use_internal_errors(true);
$dom = new DomDocument();
$dom->loadHtmlFile($url);
$lists = $dom->getElementsByTagName("ul");
$items = $lists[1]->getElementsByTagName("li");
foreach ($items as $item) {
    // clean up extra whitespace
    $text = preg_replace("/\s+/", " ", trim($item->textContent));
    echo "$text\n------\n";
}

Output:

ICA 2022: Aeroponics Conference, Tokyo (Jan 07-08, 2022)
------
ICAA 2022: Agroforestry and Applications Conference, Tokyo (Jan 07-08, 2022)
------
ICAAAA 2022: Applied Aerodynamics, Aeronautics and Astronautics Conference, Tokyo (Jan 07-08, 2022)
------
ICAAAE 2022: Aquatic Animals and Aquaculture Engineering Conference, Tokyo (Jan 07-08, 2022)
------
ICAAC 2022: Advances in Astronomical Computing Conference, Tokyo (Jan 07-08, 2022)
------
...

Also worth noting that the conference name is in an <a> element, the location is in a <span> within it, and the date follows it. Using this, you could fairly simply extract the data:

function getNodeText(\DomNode $node): string
{
    $return = "";
    foreach($node->childNodes as $child) {
        if ($child->nodeName === "#text") {
            $return .= trim($child->nodeValue);
        }
    }
    return $return;
}

foreach ($items as $item) {
    $conference = getNodeText($item->getElementsByTagName("a")[0]);
    $location = getNodeText($item->getElementsByTagName("span")[0]);
    $date = getNodeText($item);
    echo "------\n$conference | $location | $date\n";
}

Output:

------
ICA 2022: Aeroponics Conference, | Tokyo | (Jan 07-08, 2022)
------
ICAA 2022: Agroforestry and Applications Conference, | Tokyo | (Jan 07-08, 2022)
------
ICAAAA 2022: Applied Aerodynamics, Aeronautics and Astronautics Conference, | Tokyo | (Jan 07-08, 2022)
------
ICAAAE 2022: Aquatic Animals and Aquaculture Engineering Conference, | Tokyo | (Jan 07-08, 2022)
------
ICAAC 2022: Advances in Astronomical Computing Conference, | Tokyo | (Jan 07-08, 2022)
...

Upvotes: 1

Related Questions