Reputation: 129
I have this code to try and extract a list on a page:
$websiteURL = "https://waset.org/conferences-in-january-2022-in-tokyo";
$html = file_get_html($websiteURL);
foreach ( $html->find( 'ul') as $ul ) {
foreach($ul->find('li') as $li) {
echo "LI: " . $li . "<br>";
}
}
This does what I expect it to do (that is display every <li>
for ALL <ul>
's on the page.
However, If I replace the second foreach
with (as I only want to get first list):
foreach ( $html->find( 'ul', 1) as $ul ) {
I get:
"Call to a member function find() on int"
... which suggests that find('ul', 1)
did not return anything, but I don't know why?
Note: There are more than two lists on this page.
Anybody know what I am doing wrong?
Upvotes: 0
Views: 58
Reputation: 42714
To answer your question "I suppose my bottom line question is how do I access all the <li>
's from the second on a web page?" using an API that is relatively modern, well-supported, and built into PHP:
<?php
$url = "https://waset.org/conferences-in-january-2022-in-tokyo";
libxml_use_internal_errors(true);
$dom = new DomDocument();
$dom->loadHtmlFile($url);
$lists = $dom->getElementsByTagName("ul");
$items = $lists[1]->getElementsByTagName("li");
foreach ($items as $item) {
// clean up extra whitespace
$text = preg_replace("/\s+/", " ", trim($item->textContent));
echo "$text\n------\n";
}
Output:
ICA 2022: Aeroponics Conference, Tokyo (Jan 07-08, 2022)
------
ICAA 2022: Agroforestry and Applications Conference, Tokyo (Jan 07-08, 2022)
------
ICAAAA 2022: Applied Aerodynamics, Aeronautics and Astronautics Conference, Tokyo (Jan 07-08, 2022)
------
ICAAAE 2022: Aquatic Animals and Aquaculture Engineering Conference, Tokyo (Jan 07-08, 2022)
------
ICAAC 2022: Advances in Astronomical Computing Conference, Tokyo (Jan 07-08, 2022)
------
...
Also worth noting that the conference name is in an <a>
element, the location is in a <span>
within it, and the date follows it. Using this, you could fairly simply extract the data:
function getNodeText(\DomNode $node): string
{
$return = "";
foreach($node->childNodes as $child) {
if ($child->nodeName === "#text") {
$return .= trim($child->nodeValue);
}
}
return $return;
}
foreach ($items as $item) {
$conference = getNodeText($item->getElementsByTagName("a")[0]);
$location = getNodeText($item->getElementsByTagName("span")[0]);
$date = getNodeText($item);
echo "------\n$conference | $location | $date\n";
}
Output:
------
ICA 2022: Aeroponics Conference, | Tokyo | (Jan 07-08, 2022)
------
ICAA 2022: Agroforestry and Applications Conference, | Tokyo | (Jan 07-08, 2022)
------
ICAAAA 2022: Applied Aerodynamics, Aeronautics and Astronautics Conference, | Tokyo | (Jan 07-08, 2022)
------
ICAAAE 2022: Aquatic Animals and Aquaculture Engineering Conference, | Tokyo | (Jan 07-08, 2022)
------
ICAAC 2022: Advances in Astronomical Computing Conference, | Tokyo | (Jan 07-08, 2022)
...
Upvotes: 1