Lion
Lion

Reputation: 17879

Parse XML from not well formed page using xpath

Notice: While writing this question, I notice that there is a Github API that solves my problem without HTML parsing: https://api.github.com/repos/mozilla/geckodriver/releases/latest I decided to ask it anyway since I'm intested how to solve the described problem of parsing malformed HTML itself. So please dont downvote because there is a github API for it! We can replace github by any other page throwing validation errors.

I want to download the latest version of geckodriver. By fetching the redirection target of the latest tag, I'm on the releases page

curl $(curl -s "https://github.com/mozilla/geckodriver/releases/latest" --head | grep -i location | awk '{print $2}' | sed 's/\r//g') > /tmp/geckodriver.html

The first assets with geckodriver-vx.xxx-linux64.tar.gz is the required link. Since XML is schemantic, it should be parsed properly. Different tools like xmllint could parse it using xpaths. Since xpath is new for me, I tried a simple query on the header. But xmllint throws a lot of errors:

$ xmllint --xpath '//div[@class=Header]' /tmp/geckodriver.html
/tmp/geckodriver.html:51: parser error : Specification mandate value for attribute data-pjax-transient
  <meta name="selected-link" value="repo_releases" data-pjax-transient>
                                                                      ^
/tmp/geckodriver.html:107: parser error : Opening and ending tag mismatch: link line 105 and head
  </head>
         ^
/tmp/geckodriver.html:145: parser error : Entity 'nbsp' not defined
                Sign&nbsp;up
                          ^
/tmp/geckodriver.html:172: parser error : Entity 'rarr' not defined
es <span class="Bump-link-symbol float-right text-normal text-gray-light">&rarr;
...

There are a lot more. It seems that the github page is not properly well formed, as the specification requires it. I also tried xmlstarlet

xmlstarlet sel -t -v -m '//div[@class=Header]' /tmp/geckodriver.html

but the result is similar.

Is it not possible to extract some data using those tools when the HTML is not well formed?

Upvotes: 3

Views: 375

Answers (1)

Joe
Joe

Reputation: 31057

curl $(curl -s "https://github.com/mozilla/geckodriver/releases/latest" --head | grep -i location | awk '{print $2}' | sed 's/\r//g') > /tmp/geckodriver.html

It may be simpler to use -L, and have curl follow the redirection:

curl -L https://github.com/mozilla/geckodriver/releases/latest

Then, xmllint accepts an --html argument, to use an HTML parser:

xmllint --html --xpath '//div[@class=Header]'

However, this doesn't match anything on that page, so perhaps you want to base your XPath on something like:

'string((//a[span[contains(.,"linux")]])[1]/@href)'

Which yields:

/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-linux32.tar.gz

Upvotes: 1

Related Questions