Reputation: 110
I have an index HTML file with file/dir listing. It is just a usual filebrowser like :
...content here...
<td><a href="20130011/">20120011/</a></td>
<td><a href="20130111/">20120111/</a></td>
<td><a href="20130211/">20120211/</a></td>
<td><a href="20130411/">20120411/</a></td>
...content here...
I don't understand how to extract the 2nd line from the bottom.
1) I downloaded HTML with curl
content=$(curl -sL "http://path-to-html")
2) then used
dir=$(echo $content | sed '/.*href="\([0-9]*\/\)".*/!d;s//\1/;q')
which gives me the last match : 20120411
.
But how to get the previous one ?
I don't know the total count of items.
Upvotes: 0
Views: 135
Reputation: 10039
dir=$(echo $content | sed sed -n '/href="\([0-9]\{1,\}\/\)"/ {s|.*href="\([0-9]\{1,\}/\)".*|-\1-|;H;}
$ {x;l;s|.*-\([0-9]\{1,\}/\)-\(\n-[0-9]\{1,\}/-\)\{1\}$|\1|p;}')
The 1 in \{1\}$
specify how much line must be removed from the end
Upvotes: 0
Reputation: 141810
This awk program will print the penultimate line:
echo ${content} | awk '{ pen = ult; ult = $0 } END { print pen }'
This will print the penultimate matching line:
echo ${content} | awk '/href="([0-9]{8}\/)"/ { pen = ult; ult = $0 } END { print pen }'
If you just want to extract the first capture group:
echo ${content} | awk 'match($0, /href="([0-9]{8}\/)"/, a) { pen = ult; ult = a[1] } END { print pen }'
Putting it all together:
bash-4.2$ dir=$(curl -sL http://www.arteetmarte.no/tmp/index.html |
awk 'match($0, /href="([0-9]{8}\/)"/, a) {
pen = ult
ult = a[1]
}
END {
print pen
}
')
bash-4.2$ echo ${dir}
20130918/
Tested with: GNU Awk 4.1.0, API: 1.0
Upvotes: 3