Reputation: 53
There is a web page accessable to me via a intranet URL that I have no access to edit. It contains various span elements with text in them that I want to capture to use elsewhere. The span elements that I want each have a unique id so I would like to use this id to identify and capture the text I want. I', trying to use PHP's Domdocument to do this.
Here is an example of the html from the url.
<td class="style12">
<div id="upINMain">
<span id="car7">90</span>
</div>
</td>
Note: if I visit the url in a browser, I can see it's a full HTML document, the above is just a snippet.
Here is some of the the PHP code I'm trying to use to grab the various values.
// scrape the page to pull data.
$page = file_get_contents([full url I have pulled from database here including http bit etc]);
$doc = new DOMDocument();
$doc->validateOnParse = true;
$doc->preserveWhiteSpace = false;
$doc->loadHTML($page);
// define id attributes
foreach($doc->getElementsByTagName('span') as $element) {
$element->setIdAttribute('id',true);
}
// now work out from the table which ids we need to scrape and how many.
$Column1Name = $ReadIDMapsRow['column1_name'];
$Column1Value = $doc->getElementById($ReadIDMapsRow['column1_id']);
$Column1ValueText = $Column1Value->textContent;
(In the above code, $ReadIDMapsRow['column1_id'] contains the id of the element I'm trying to capture, a string 'car7'.)
But when I looking at a get_defined_vars() debug print out I have on the output page I'm putting all this into, I can see the var $Column1ValueText is empty. (Along with any others I'm getting the same way)
[Column1Name] => CAR
[Column1Value] =>
[Column1ValueText] =>
It might be relevant that I also noticed that when I look at my debug into, I can see that the $doc debug info says
[doc] => DOMDocument Object
(
[doctype] => (object value omitted) <- this is a lie, it does have a doc type!
[implementation] => (object value omitted)
[documentElement] => (object value omitted)
[actualEncoding] =>
[encoding] =>
[xmlEncoding] =>
[standalone] => 1
But if I inspect the page in Chrome it does have a doc type declaration at the top, and It's not just Chrome being generous and adding it, because I can see it in the $page var in my debug also:
[page] =>
<!DOCTYPE html>
...
Edit for Nigel: The actual code block for capturing the different values I want looks like this.
// define id attributes
foreach($doc->getElementsByTagName('span') as $element) {
$element->setIdAttribute('id',true);
}
// now work out from the table which ids we need to scrape and how many.
if (!empty($ReadIDMapsRow['column1_name'])) {
$Column1Name = $ReadIDMapsRow['column1_name'];
$Column1Value = $doc->getElementById($ReadIDMapsRow['column1_id']);
$Column1ValueText = $Column1Value->textContent;
}
if (!empty($ReadIDMapsRow['column2_name'])) {
$Column2Name = $ReadIDMapsRow['column2_name'];
$Column2Value = $doc->getElementById($ReadIDMapsRow['column2_id']);
$Column2ValueText = $Column2Value->textContent;
}
if (!empty($ReadIDMapsRow['column3_name'])) {
$Column3Name = $ReadIDMapsRow['column3_name'];
$Column3Value = $doc->getElementById($ReadIDMapsRow['column3_id']);
$Column3ValueText = $Column3Value->textContent;
}
etc... 10 of these blocks of code in total.
It pulls from a row in a database and it's purpose is to look to this row to decide the URL and how many element ids to look for on the html page and what their ids are. (The idea being I can just edit or add a row to this table to make it look for different things from different pages.
Upvotes: 0
Views: 54
Reputation: 57121
This is what I have got to work from your code so far...
$doc = new DOMDocument();
$doc->validateOnParse = true;
$doc->preserveWhiteSpace = false;
$doc->loadHTML($page);
$ReadIDMapsRow = ['column1_name' => 'CAR', 'column1_id' => 'car7'];
$Column1Name = $ReadIDMapsRow['column1_name'];
$Column1Value = $doc->getElementById($ReadIDMapsRow['column1_id']);
$Column1ValueText = $Column1Value->textContent;
echo $Column1Name.PHP_EOL;
echo $Column1ValueText.PHP_EOL;
which gives...
CAR
90
Upvotes: 1