Reputation: 1417

PHP DOMDocument Namespaces

I'm writing a script that takes a webpage and detects how many times stuff like a facebook like button is used. Since this would best be done with a DOM, I decided to use PHP's DOMDocument.

The one problem I have come across, though, is for elements like facebook's like button:

<fb:like send="true" width="450" show_faces="true"></fb:like>

Since this element technically has a namespace of "fb", DOMDocument throws a warning saying this namespace prefix is not defined. It then proceeds to strip off the prefix, so when I get to said element, its tag is no longer fb:like, but instead, like.

Is there any way to "pre-register" a namespace? Any suggestions?

Upvotes: 7

Answers (6)

BernieMaier

Reputation: 39

tried the regEx-solution... there's a problem with the closing tags, as they do not accept attributes!

<ns namespace="node">text</ns>

(above all, the regEx didn't look for closing tags...) so finally i did some UGLY stuff like

$output = preg_replace('/<(\/?)(\w+):(\w+)/', '<\1\2thistaghasanamespace\3' , $output);

and

$output = preg_replace('/<(\/?)(\w+)thistaghasanamespace(\w+)/', '<\1\2:\3' , $output);

Upvotes: -1

lupos

Reputation: 374

Since this was never "solved" I decided to go ahead and implement syndance's solution for anyone else who doesn't like figuring out regular expressions.

// do this before you use loadHTML()    
// store any name spaced elements so we can re-add them later
$postContent = preg_replace('/<(\w+):(\w+)/', '<\1 data-namespace="\2"' , $postContent);

// once you are done using domdocument fix things up
// re-construct any name-spaced tags
$postContent = preg_replace('/<(\w+) data-namespace="(\w+)"/', '<\1:\2 ' , $postContent);

Upvotes: 1

Syndace

Reputation: 96

I was having the same issue and I came up with following solutions/workarounds:

There is no clean way to parse HTML with namespaces using DOMDocument without losing the namespaces but there are some workarounds:

Use another parser that accepts namespaces in HMTL code. Look here for a nice and detailed list of HTML parsers. This is probably the most efficient way to do it.
If you want to stick with DOMDocument you basically have to pre- and postprocess the code.
- Before you send the code to DOMDocument->loadHTML, use regex, loops or whatever you want to find all namespaced tags and add a custom attribute to the opening tags containing the namespace.
```
<fb:like send="true" width="450" show_faces="true"></fb:like>
```
  would then result in
```
<fb:like xmlNamespace="fb" send="true" width="450" show_faces="true"></fb:like>
```
- Now give the edited code to DOMDocument->loadHTML. It will strip out the namespaces but it will keep the attributes resulting in
```
<like xmlNamespace="fb" send="true" width="450" show_faces="true"></like>
```
- Now (again using regex, loops or whatever you want) find all tags with the attribute xmlNamespace and replace the attribute with the actual namespace. Don't forget to also add the namespace to the closing tags!

I don't think OP is still looking for an answer, I'm just posting this for anybody that finds this post in their research.

Upvotes: 0

goat

Reputation: 31813

You could use tidy to spruce things up before using an xml parser on it.

$tidy = new tidy();
$config = array(
    'output-xml'   => true, 
    'input-xml'    => true, 
    'add-xml-decl' => true,
);
$tidy->ParseString($htmlSoup, $config);
$tidy->cleanRepair();
echo $tidy;

Upvotes: 4

Explosion Pills

Reputation: 191749

Haven't been able to find a way to do it with DOM. I'm surprised the regex is slower than DOMDocument as that's usually not the case for me. strpos should be the fastest, though:

strpos($dom, '<fb:like');

This only finds the first occurance, but you can write a simple recursive function that changes the offset appropriately.

Upvotes: 0

Jonathan

Reputation: 585

Is this what you are looking for?

You could try SimpleHTMLDOM. You can then run something like...

$html = new simple_html_dom();
$html->load_file('fileToParse.html');
$count=0;
foreach($html->find('fb:like') as $element){
    $count+=1
}
echo $count;

That should work.

I looked a bit further and found this. I took this from the DOMDocument on PHP.net.

$dom = new DOMDocument;
$dom->loadHTML('fileToParse.html'); // or $dom->loadXML('fileToParse.html'); 
$likes = $dom->getElementsByTagName('fb:like');
$count=0;
foreach ($likes as $like) {
    $count+=1;
}

After this one I am stuck

$file=file_get_contents("other.html");
$search = '/<fb:like[^>]*>/';
$count  = preg_match_all($search , $file, $matches);
echo $count;
//Below is not needed
print_r($matches);

That however is RegEx and is quite slow. I Tried:

$dom = new DOMDocument;
$xpath = new DOMXPath($dom);
$dom->load("other.html");
$xpath = new DOMXPath($dom);
$rootNamespace = $dom->lookupNamespaceUri($dom->namespaceURI); 
$xpath->registerNamespace('fb', $rootNamespace); 
$elementList = $xpath->query('//fb:like');

But got the same error as you.

Upvotes: 0

PHP DOMDocument Namespaces

Answers (6)

Related Questions