Reputation: 1553
I'm using a crawler to retrieve the HTML content of certain pages on the web. I currently have the entire HTML stored in a single PHP variable:
$string = "<PRE>".htmlspecialchars($crawler->results)."</PRE>\n";
What I want to do is select all "p" tags (for example) and store their in an array. What is the proper way to do that?
I've tried the following, by using xpath, but it doesn't show anything (most probably because the document itself isn't an XML, I just copy-pasted the example given in its documentation).
$xml = new SimpleXMLElement ($string);
$result=$xml->xpath('/p');
while(list( , $node)=each($result)){
echo '/p: ' , $node, "\n";
}
Hopefully someone with (a lot) more experience in PHP will be able to help me out :D
Upvotes: 1
Views: 2832
Reputation: 6389
Check out Simple HTML Dom. It will grab external pages and process them with fairly accurate detail.
http://simplehtmldom.sourceforge.net/
It can be used like this:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
Upvotes: 1
Reputation: 1549
Try using DOMDocument along with DOMDocument::getElementsByTagName. The workflow should be quite simple. Something like:
$doc = DOMDocument::loadHTML(htmlspecialchars($crawler->results));
$pNodes = $doc->getElementsByTagName('p');
Which will return a DOMNodeList.
Upvotes: 3
Reputation: 61
I vote for use regexp. For tag p
preg_match_all('/<p>(.*)<\/p>/', '<p>foo</p><p>foo 1</p><p>foo 2</p>', $arr, PREG_PATTERN_ORDER);
if(is_array($arr))
{
foreach($arr as $value)
{
echo $value."</br>";
}
}
Upvotes: 2