Reputation: 17

how to extract an XPATH from an html page with BaseX commandline

I would like to extract the XPATH //DIV[@id="ps-content"] out from this web page: http://www.amazon.com/dp/1449319432 (saved as a local file)

I would like to do it with a single line of command-line with one of the best parsers, like BaseX or Saxon-PE.

So far the shortest solution that I (seemed to have) found is with these two lines:

java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"
basex -ipage.xhtml "//DIV[@id='ps-content']"

but all what it returns is an empty line, instead than my expected block of html code:

My questions are two:

what's wrong with my command-lines? why they doesn't return the expected block of html code as defined by my XPATH?
since BaseX has embedded TagSoup capability (see https://www.odesk.com/leaving-odesk?ref=http%253A%252F%252Fdocs.basex.org%252Fwiki%252FParsers%2523HTML_Parser), how can I integrate my two lines into a single line?

Upvotes: 1

Answers (2)

Imbuter

Reputation: 17

I finally found the right command-line:

basex "declare option db:parser 'html'; doc('page.html')//*:div[@id='ps-content']"

Note: inverting the type of quotes like this doesn't work in my Win7:

basex 'declare option db:parser "html"; doc("page.html")//*:div[@id="ps-content"]'

Upvotes: 0

Jens Erat

Reputation: 38702

There are two problems with your query:

Tagsoup adds namespaces

Either register the namespace (it seems reasonable to declare the default namespace as you're probably only dealing with XHTML):
```
basex -ipage.xhtml "declare default element namespace 'http://www.w3.org/1999/xhtml'; //div[@id='ps-content']"
```
or use * as namespace indicator for each element:
```
basex -ipage.xhtml "//*:div[@id='ps-content']"
```
XML/XQuery is case sensitive

I already corrected it in my queries in (1): <div/> is not the same as <DIV/>. Both queries in (1) already yield the expected result.

Tagsoup can be used from within BaseX, you do not have to call it separately for HTML-input. Make sure to include tagsoup in your default Java classpath, eg. by installing libtagsoup-java in Debian.

basex 'declare option db:parser "html"; doc("page.html")//*:div[@id="ps-content"]'

You can even query the HTML page directly from BaseX if you want to:

basex 'declare option db:parser "html"; doc("http://www.amazon.com/dp/1449319432")//*:div[@id="ps-content"]'

Using -i didn't work for me with using tagsoup, but you can use doc(...) instead.

Upvotes: 1

how to extract an XPATH from an html page with BaseX commandline

Answers (2)

Related Questions