Reputation: 1158
I am trying to extract the link (href) and text inside the <a>
tag for a number of links in an html page.
I only want specific links, which I match by a substring.
Example of my html:
<a href="/this/dir/1234/">This should be 1234</a> some other html
<a href="/this/dir/1236/">This should be 1236</a> some other html
<a href="/about_us/">Not important link</a> some other html
I am using Xidel, which allows me to avoid regexp. It seems to be the simplest for the job.
What I have so far:
xidel -e "//a/(@href[contains(.,'/this/dir')],text())"
It basically works, but two issues remain:
What is recommended way to get output like
/this/dir/1234 ; This should be 1234
/this/dir/1236 ; This should be 1236
Appreciate any feedback / tips.
edit:
The solution provided by Martin was 99% there. Newlines were not output, so I am using awk to replace a dummy text with newlines.
note : I am on windows.
xidel myhtml.htm -e "string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), 'XXX')" | awk -F "XXX" "{$1=$1}1" "OFS=\n"
Upvotes: 1
Views: 1038
Reputation: 167676
You can move the condition into a predicate e.g. //a[contains(@href, '/this/dir')]!(@href, string())
. As for the result format, what happens if you delegate all to XQuery with
string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), ' ')
Upvotes: 2