d3pd
d3pd

Reputation: 8295

Parsing for data in HTML using XPath (in a shell script)

I am trying to parse a fairly simple web page for information in a shell script. The web page I'm working with now is generated here. For example, I would like to pull the information on the internet service provider into a shell variable. It may make sense to use one of the programs xmllint, XMLStarlet or xpath for this purpose. I am quite familiar with shell scripting, but I am new to XPath syntax and the utilities used to implement the XPath syntax, so I would appreciate a few pointers in the right direction.

Here's the beginnings of the shell script:

HTMLISPInformation="$(curl --user-agent "Mozilla/5.0" http://aruljohn.com/details.php)"
# ISP="$(<XPath magic goes here.>)"

For your convenience, here is a utility for dynamically testing XPath syntax online:

http://www.bit-101.com/xpath/

Upvotes: 12

Views: 10195

Answers (5)

kenorb
kenorb

Reputation: 166319

xpup

XML

A command line XML parsing tool written in Go. For example:

$ curl -sL https://www.w3schools.com/xml/note.xml | xpup '/*/body'
Don't forget me this weekend!

or:

$ xpup '/note/from' < <(curl -sL https://www.w3schools.com/xml/note.xml)
Jani

HTML

Here is the example of parsing HTML page:

$ xpup '/*/head/title' < <(curl -sL https://example.com/)
Example Domain

pup

For HTML parsing, try pup. For example:

$ pup 'title text{}' -f <(curl -sL https://example.com/)
Example Domain

See related Feature Request for XPath.

Installation

Install by: go get github.com/ericchiang/pup.

Upvotes: 3

BeniBela
BeniBela

Reputation: 16907

You could use my Xidel. Extracting values from html pages in the cli is its main purpose. Although it is not a standard tool, it is a single, dependency-free binary, and can be installed/run without being root.

It can directly read the value from the webpage without involving other programs.

With XPath:

xidel http://aruljohn.com/details.php -e '//td[text()="Internet Provider"]/following-sibling::td'

Or with pattern-matching:

xidel http://aruljohn.com/details.php -e '<td>Internet Provider</td><td>{.}</td>' --hide-variable-names

Upvotes: 5

Michel Guillet
Michel Guillet

Reputation: 455

Quick and dirty solution...

xmllint --html -xpath "//table/tbody/tr[6]/td[2]" page.html

You can find the xpath of your node using Chrome and the Developer Tools. When inspecting the node, right click on it and select copy XPath.

I wouldn't use this too much, this is not very reliable.

All the information on your page can be found elsewhere: run whois on your own IP for instance...

Upvotes: 10

kenorb
kenorb

Reputation: 166319

HTML-XML-utils

There are many command-line tools in HTML-XML-utils package which can parse HTML files (e.g. hxselect to match a CSS selector).

Also there is xpath which is command-line wrapper around Perl's XPath library (XML::Path).

Related: Command line tool to query HTML elements at SU

Upvotes: 1

asgoth
asgoth

Reputation: 35829

Consider on using PhantomJs. It is a headless WebKit, which allows you to execute JavaScript/CoffeeScript on a web page. I think it could help you solve your issue.

Pjscrape is a useful web scraping tool based on PhantomJs.

Upvotes: 3

Related Questions