Reputation: 1417
I'm writing a script that takes a webpage and detects how many times stuff like a facebook like button is used. Since this would best be done with a DOM, I decided to use PHP's DOMDocument.
The one problem I have come across, though, is for elements like facebook's like button:
<fb:like send="true" width="450" show_faces="true"></fb:like>
Since this element technically has a namespace of "fb", DOMDocument throws a warning saying this namespace prefix is not defined. It then proceeds to strip off the prefix, so when I get to said element, its tag is no longer fb:like, but instead, like.
Is there any way to "pre-register" a namespace? Any suggestions?
Upvotes: 7
Views: 5916
Reputation: 39
tried the regEx-solution... there's a problem with the closing tags, as they do not accept attributes!
<ns namespace="node">text</ns>
(above all, the regEx didn't look for closing tags...) so finally i did some UGLY stuff like
$output = preg_replace('/<(\/?)(\w+):(\w+)/', '<\1\2thistaghasanamespace\3' , $output);
and
$output = preg_replace('/<(\/?)(\w+)thistaghasanamespace(\w+)/', '<\1\2:\3' , $output);
Upvotes: -1
Reputation: 374
Since this was never "solved" I decided to go ahead and implement syndance's solution for anyone else who doesn't like figuring out regular expressions.
// do this before you use loadHTML()
// store any name spaced elements so we can re-add them later
$postContent = preg_replace('/<(\w+):(\w+)/', '<\1 data-namespace="\2"' , $postContent);
// once you are done using domdocument fix things up
// re-construct any name-spaced tags
$postContent = preg_replace('/<(\w+) data-namespace="(\w+)"/', '<\1:\2 ' , $postContent);
Upvotes: 1
Reputation: 96
I was having the same issue and I came up with following solutions/workarounds:
There is no clean way to parse HTML with namespaces using DOMDocument without losing the namespaces but there are some workarounds:
If you want to stick with DOMDocument you basically have to pre- and postprocess the code.
Before you send the code to DOMDocument->loadHTML, use regex, loops or whatever you want to find all namespaced tags and add a custom attribute to the opening tags containing the namespace.
<fb:like send="true" width="450" show_faces="true"></fb:like>
would then result in
<fb:like xmlNamespace="fb" send="true" width="450" show_faces="true"></fb:like>
Now give the edited code to DOMDocument->loadHTML. It will strip out the namespaces but it will keep the attributes resulting in
<like xmlNamespace="fb" send="true" width="450" show_faces="true"></like>
I don't think OP is still looking for an answer, I'm just posting this for anybody that finds this post in their research.
Upvotes: 0
Reputation: 31813
You could use tidy to spruce things up before using an xml parser on it.
$tidy = new tidy();
$config = array(
'output-xml' => true,
'input-xml' => true,
'add-xml-decl' => true,
);
$tidy->ParseString($htmlSoup, $config);
$tidy->cleanRepair();
echo $tidy;
Upvotes: 4
Reputation: 191749
Haven't been able to find a way to do it with DOM
. I'm surprised the regex is slower than DOMDocument
as that's usually not the case for me. strpos
should be the fastest, though:
strpos($dom, '<fb:like');
This only finds the first occurance, but you can write a simple recursive function that changes the offset appropriately.
Upvotes: 0
Reputation: 585
Is this what you are looking for?
You could try SimpleHTMLDOM. You can then run something like...
$html = new simple_html_dom();
$html->load_file('fileToParse.html');
$count=0;
foreach($html->find('fb:like') as $element){
$count+=1
}
echo $count;
That should work.
I looked a bit further and found this. I took this from the DOMDocument on PHP.net.
$dom = new DOMDocument;
$dom->loadHTML('fileToParse.html'); // or $dom->loadXML('fileToParse.html');
$likes = $dom->getElementsByTagName('fb:like');
$count=0;
foreach ($likes as $like) {
$count+=1;
}
After this one I am stuck
$file=file_get_contents("other.html");
$search = '/<fb:like[^>]*>/';
$count = preg_match_all($search , $file, $matches);
echo $count;
//Below is not needed
print_r($matches);
That however is RegEx and is quite slow. I Tried:
$dom = new DOMDocument;
$xpath = new DOMXPath($dom);
$dom->load("other.html");
$xpath = new DOMXPath($dom);
$rootNamespace = $dom->lookupNamespaceUri($dom->namespaceURI);
$xpath->registerNamespace('fb', $rootNamespace);
$elementList = $xpath->query('//fb:like');
But got the same error as you.
Upvotes: 0