Imbuter
Imbuter

Reputation: 17

how to extract an XPATH from an html page with BaseX commandline

I would like to extract the XPATH //DIV[@id="ps-content"] out from this web page: http://www.amazon.com/dp/1449319432 (saved as a local file)

I would like to do it with a single line of command-line with one of the best parsers, like BaseX or Saxon-PE.

So far the shortest solution that I (seemed to have) found is with these two lines:

java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"
basex -ipage.xhtml "//DIV[@id='ps-content']"

but all what it returns is an empty line, instead than my expected block of html code:

My questions are two:

Upvotes: 1

Views: 874

Answers (2)

Imbuter
Imbuter

Reputation: 17

I finally found the right command-line:

basex "declare option db:parser 'html'; doc('page.html')//*:div[@id='ps-content']"

Note: inverting the type of quotes like this doesn't work in my Win7:

basex 'declare option db:parser "html"; doc("page.html")//*:div[@id="ps-content"]'

Upvotes: 0

Jens Erat
Jens Erat

Reputation: 38702

There are two problems with your query:

  1. Tagsoup adds namespaces

    Either register the namespace (it seems reasonable to declare the default namespace as you're probably only dealing with XHTML):

    basex -ipage.xhtml "declare default element namespace 'http://www.w3.org/1999/xhtml'; //div[@id='ps-content']"
    

    or use * as namespace indicator for each element:

    basex -ipage.xhtml "//*:div[@id='ps-content']"
    
  2. XML/XQuery is case sensitive

    I already corrected it in my queries in (1): <div/> is not the same as <DIV/>. Both queries in (1) already yield the expected result.


Tagsoup can be used from within BaseX, you do not have to call it separately for HTML-input. Make sure to include tagsoup in your default Java classpath, eg. by installing libtagsoup-java in Debian.

basex 'declare option db:parser "html"; doc("page.html")//*:div[@id="ps-content"]'

You can even query the HTML page directly from BaseX if you want to:

basex 'declare option db:parser "html"; doc("http://www.amazon.com/dp/1449319432")//*:div[@id="ps-content"]'

Using -i didn't work for me with using tagsoup, but you can use doc(...) instead.

Upvotes: 1

Related Questions