MyICQ
MyICQ

Reputation: 1158

Extract both href and text on same line using Xidel, specific links only

I am trying to extract the link (href) and text inside the <a> tag for a number of links in an html page.

I only want specific links, which I match by a substring.

Example of my html:

<a href="/this/dir/1234/">This should be 1234</a> some other html
<a href="/this/dir/1236/">This should be 1236</a> some other html
<a href="/about_us/">Not important link</a> some other html

I am using Xidel, which allows me to avoid regexp. It seems to be the simplest for the job.

What I have so far:

xidel -e "//a/(@href[contains(.,'/this/dir')],text())"

It basically works, but two issues remain:

What is recommended way to get output like

/this/dir/1234  ; This should be 1234
/this/dir/1236  ; This should be 1236

Appreciate any feedback / tips.

edit:

The solution provided by Martin was 99% there. Newlines were not output, so I am using awk to replace a dummy text with newlines.

note : I am on windows.

xidel myhtml.htm -e "string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), 'XXX')" | awk -F "XXX" "{$1=$1}1" "OFS=\n" 

Upvotes: 1

Views: 1038

Answers (1)

Martin Honnen
Martin Honnen

Reputation: 167676

You can move the condition into a predicate e.g. //a[contains(@href, '/this/dir')]!(@href, string()). As for the result format, what happens if you delegate all to XQuery with

string-join(//a[contains(@href, '/this/dir')]!(@href || ' ; ' || .), '&#10;')

Upvotes: 2

Related Questions