Reputation: 17
I would like to extract the XPATH //DIV[@id="ps-content"] out from this web page: http://www.amazon.com/dp/1449319432 (saved as a local file)
I would like to do it with a single line of command-line with one of the best parsers, like BaseX or Saxon-PE.
So far the shortest solution that I (seemed to have) found is with these two lines:
java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"
basex -ipage.xhtml "//DIV[@id='ps-content']"
but all what it returns is an empty line, instead than my expected block of html code:
My questions are two:
Upvotes: 1
Views: 874
Reputation: 17
I finally found the right command-line:
basex "declare option db:parser 'html'; doc('page.html')//*:div[@id='ps-content']"
Note: inverting the type of quotes like this doesn't work in my Win7:
basex 'declare option db:parser "html"; doc("page.html")//*:div[@id="ps-content"]'
Upvotes: 0
Reputation: 38702
There are two problems with your query:
Tagsoup adds namespaces
Either register the namespace (it seems reasonable to declare the default namespace as you're probably only dealing with XHTML):
basex -ipage.xhtml "declare default element namespace 'http://www.w3.org/1999/xhtml'; //div[@id='ps-content']"
or use *
as namespace indicator for each element:
basex -ipage.xhtml "//*:div[@id='ps-content']"
XML/XQuery is case sensitive
I already corrected it in my queries in (1): <div/>
is not the same as <DIV/>
. Both queries in (1) already yield the expected result.
Tagsoup can be used from within BaseX, you do not have to call it separately for HTML-input. Make sure to include tagsoup in your default Java classpath, eg. by installing libtagsoup-java
in Debian.
basex 'declare option db:parser "html"; doc("page.html")//*:div[@id="ps-content"]'
You can even query the HTML page directly from BaseX if you want to:
basex 'declare option db:parser "html"; doc("http://www.amazon.com/dp/1449319432")//*:div[@id="ps-content"]'
Using -i
didn't work for me with using tagsoup, but you can use doc(...)
instead.
Upvotes: 1