Eugen
Eugen

Reputation: 1553

How to extract only certain tags from HTML document using PHP?

I'm using a crawler to retrieve the HTML content of certain pages on the web. I currently have the entire HTML stored in a single PHP variable:

$string = "<PRE>".htmlspecialchars($crawler->results)."</PRE>\n";

What I want to do is select all "p" tags (for example) and store their in an array. What is the proper way to do that?

I've tried the following, by using xpath, but it doesn't show anything (most probably because the document itself isn't an XML, I just copy-pasted the example given in its documentation).

$xml = new SimpleXMLElement ($string);

    $result=$xml->xpath('/p');
    while(list( , $node)=each($result)){
        echo '/p: ' , $node, "\n"; 
    }

Hopefully someone with (a lot) more experience in PHP will be able to help me out :D

Upvotes: 1

Views: 2832

Answers (3)

Paul Dessert
Paul Dessert

Reputation: 6389

Check out Simple HTML Dom. It will grab external pages and process them with fairly accurate detail.

http://simplehtmldom.sourceforge.net/

It can be used like this:

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
   echo $element->src . '<br>';

Upvotes: 1

clexmond
clexmond

Reputation: 1549

Try using DOMDocument along with DOMDocument::getElementsByTagName. The workflow should be quite simple. Something like:

$doc = DOMDocument::loadHTML(htmlspecialchars($crawler->results));
$pNodes = $doc->getElementsByTagName('p');

Which will return a DOMNodeList.

Upvotes: 3

autumncollection
autumncollection

Reputation: 61

I vote for use regexp. For tag p

preg_match_all('/<p>(.*)<\/p>/', '<p>foo</p><p>foo 1</p><p>foo 2</p>', $arr, PREG_PATTERN_ORDER);
if(is_array($arr))
{
 foreach($arr as $value)
 {
   echo $value."</br>";
 }
}

Upvotes: 2

Related Questions