Reputation: 175
I know how to xpath and echo text off another website via tags like div id, class ,etc, using the below code. But, I don't know how to do it under more precise conditions, for example when trying to scrape and echo a bit of text that has no unique tag identifier like a div. This below code spits out scraped data.
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://www.nbcnews.com/business');
$xpath = new DOMXPath($doc);
$query = "//div[@class='market']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo trim($entry->textContent); // use `trim` to eliminate spaces
}
In this below source code for an example, I want to pull the value "21,271.97". But there's no unique tag for this, no div id. Is it possible to pull this data by identifying a keyword in the < p> that never changes, for example "DJIA all time".
<p>DJIA All Time, Record-High Close: <font color="#0000FF">June 9,
2017</font>
(<font color="#FF0000"><b bgcolor="#FFFFCC"><font face="Verdana, Arial,
Helvetica, sans-serif" size="2">21,271.97</font></b></font>)</p>
Wondering if I could possibly replace this with something around the lines of $query = "//div[@class='market']"; $query = "//p['DJIA all time']";
Could this be possible?
I also wonder if using a loop with something like $query = "//p[='DJIA']";? could work, though I don't know how to use that exactly. Thanks!!
Upvotes: 0
Views: 170
Reputation: 52685
Try to use below XPath
expression:
//p[contains(text(), "DJIA All Time")]//b/font
Considering provided link (http://www.nbcnews.com/business) you can get required text with
//span[text()="DJIA"]/following-sibling::span[@class="market_item market_price"]
Upvotes: 1
Reputation: 57131
It would be good to have a play with an online XPath tester - I use https://www.freeformatter.com/xpath-tester.html#ad-output
$query = "//p[contains(text(),'DJIA')]";
Although if you use the page your after, I've found that the value seems to be the first record for...
$query = "//span[contains(@class,'market_price')]";
But the idea is the same in both cases, using contains(source,value)
will match a set of nodes. In the first case the text() is the value of the node,the second looks for the specific class definition.
Upvotes: 1