Reputation: 8295
I am trying to parse a fairly simple web page for information in a shell script. The web page I'm working with now is generated here. For example, I would like to pull the information on the internet service provider into a shell variable. It may make sense to use one of the programs xmllint, XMLStarlet or xpath for this purpose. I am quite familiar with shell scripting, but I am new to XPath syntax and the utilities used to implement the XPath syntax, so I would appreciate a few pointers in the right direction.
Here's the beginnings of the shell script:
HTMLISPInformation="$(curl --user-agent "Mozilla/5.0" http://aruljohn.com/details.php)"
# ISP="$(<XPath magic goes here.>)"
For your convenience, here is a utility for dynamically testing XPath syntax online:
Upvotes: 12
Views: 10195
Reputation: 166319
xpup
A command line XML parsing tool written in Go. For example:
$ curl -sL https://www.w3schools.com/xml/note.xml | xpup '/*/body'
Don't forget me this weekend!
or:
$ xpup '/note/from' < <(curl -sL https://www.w3schools.com/xml/note.xml)
Jani
Here is the example of parsing HTML page:
$ xpup '/*/head/title' < <(curl -sL https://example.com/)
Example Domain
pup
For HTML parsing, try pup
. For example:
$ pup 'title text{}' -f <(curl -sL https://example.com/)
Example Domain
See related Feature Request for XPath.
Install by: go get github.com/ericchiang/pup
.
Upvotes: 3
Reputation: 16907
You could use my Xidel. Extracting values from html pages in the cli is its main purpose. Although it is not a standard tool, it is a single, dependency-free binary, and can be installed/run without being root.
It can directly read the value from the webpage without involving other programs.
With XPath:
xidel http://aruljohn.com/details.php -e '//td[text()="Internet Provider"]/following-sibling::td'
Or with pattern-matching:
xidel http://aruljohn.com/details.php -e '<td>Internet Provider</td><td>{.}</td>' --hide-variable-names
Upvotes: 5
Reputation: 455
Quick and dirty solution...
xmllint --html -xpath "//table/tbody/tr[6]/td[2]" page.html
You can find the xpath of your node using Chrome and the Developer Tools. When inspecting the node, right click on it and select copy XPath.
I wouldn't use this too much, this is not very reliable.
All the information on your page can be found elsewhere: run whois on your own IP for instance...
Upvotes: 10
Reputation: 166319
There are many command-line tools in HTML-XML-utils package which can parse HTML files (e.g. hxselect
to match a CSS selector).
Also there is xpath
which is command-line wrapper around Perl's XPath library (XML::Path
).
Related: Command line tool to query HTML elements at SU
Upvotes: 1