lovesh
lovesh

Reputation: 5401

is it neccessary of namespaces of an XHTML document for using in XPATH

I am scraping some webpage for some specific portions of the web page. I am using php, curl and xpath for getting the section of the page. but people suggest that i should be using namespaces of the XHTML document for the XPATHs to work. As far as i know namespaces are used to avoid collisions between names of various elements so why do i need namespaces in this case? I am actually converting the web page using Tidy to XHTML. Do I really need the namespaces and if yes in which cases because the same code without namespaces works well for scraping content from wikipedia. Also even after modifying my php code to include namespaces the code doesnt work for some urls. you can have a look at this post.

Upvotes: 0

Views: 309

Answers (2)

Mads Hansen
Mads Hansen

Reputation: 66723

It is possible to use XPath expressions that do not use namespaces.

If you are scraping web content and aren't sure whether it will be XHTML or well-formed HTML that is not bound to a namespace, then you may find it more convenient to use a more generic match criteria for your XPath that ignores the namespace of the elements.

You can do this by a generic match for any element (e.g. *) and then use a predicate filter for the local-name() of the element (e.g. *[local-name()='table']).

Doing so will match on any element with that name, whether it is bound to a particular namespace or not.

For example:

//*[local-name()='body']/*[local-name()='table'][4]
     /*[local-name()='tbody']/*[local-name()='tr'][3]
     /*[local-name()='td'][4]

Upvotes: 2

jasso
jasso

Reputation: 13966

First of all: namespaces are a fundamental concept in XML. If you are not familiar with namespaces, please take time to learn and understand them.

You need to use namespace prefixes in your XPath expressions if and only if the XML document you are processing uses namespaces.

All XPath (1.0) name tests use qualified names, that is expressions without a namespace prefix always match only to targets in no-namespace. This means that an expression /element-1/element-2 is always searching for elements that do not have a namespace definition (in another words: they belong to no-namespace, in yet another words: they have empty namespace URI). The example XPath expression works on this document...

<element-1>
    <element-2>Works!</element-2>
</element-1>

...but it doesn't work on this document...

<ns:element-1 xmlns:ns="http://example.com">
    <ns:element-2>Doesn't work</ns:element-2>
</ns:element-1>

...because in this case both the <element-1> and the <element-2> belong to a namespace (with URI http://example.com). Also notice that elements might belong to a namespace, even though they don't have any namespace prefix, if the document has a default namespace. This document...

<element-1 xmlns="http://example.com">
    <element-2>Similar to previous, and doesn't work either.</element-2>
</element-1>

...is identical to the second document example and using XPath on it also requires the usage of namespace prefixes.

Searching data from this document would require registering the namespace URI with some prefix and then using that prefix in your XPath expressions. Something like /px:element-1/px:element-2. Do note that the prefix you register doesn't need to match the one used in the document but the URIs must match exactly as they are. Another point to notice is that even though elements in default namespace don't have a prefix, you still need to use the prefix you defined in your XPath expressions in order to match them.

So the need for a namespace prefix in XPath queries depends on the document. Some web sites serve their pages as valid XHTML documents and thus all the elements belong in the XHTML namespace. Some other sites serve HTML or XHTML without a namespace, which is technically invalid XHTML.

The way how the namespace prefixes are registered depends on the XML framework or library that you use. In php and SimpleXML this is done roughly this way

$your_xml_doc->registerXPathNamespace("ns", "http://example.com");
$result = $your_xml_doc->xpath('/ns:element-1/ns:element-2');

Upvotes: 3

Related Questions