Reputation: 2406
Is it possible to Wget a page's title from the command line?
input:
$ wget http://bit.ly/rQyhG5 <<code>>
output:
If it’s broke, fix it right - Keeping it Real Estate. Home
Upvotes: 2
Views: 7609
Reputation: 11
The following will pull whatever lynx thinks the title of the page is, saving you from all of the regex nonsense. Assuming the page you are retrieving is standards compliant enough for lynx, this should not break.
lynx -dump example.com | sed '2q;d'
Upvotes: 0
Reputation: 16748
This script would give you what you need:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 's!.*<title>\(.*\)</title>.*!\1!p'
But there are lots of situations where it breaks, including if there is a <title>...</title>
in the body of the page, or if the title is on more than one line.
This might be a little better:
wget --quiet -O - http://bit.ly/rQyhG5 \
| paste -s -d " " \
| sed -e 's!.*<head>\(.*\)</head>.*!\1!' \
| sed -e 's!.*<title>\(.*\)</title>.*!\1!'
but it does not fit your case as your page contains the following head opening:
<head profile="http://gmpg.org/xfn/11">
Again, this might be better:
wget --quiet -O - http://bit.ly/rQyhG5 \
| paste -s -d " " \
| sed -e 's!.*<head[^>]*>\(.*\)</head>.*!\1!' \
| sed -e 's!.*<title>\(.*\)</title>.*!\1!'
but there is still ways to break it, including no head/title in the page.
Again, a better solution might be:
wget --quiet -O - http://bit.ly/rQyhG5 \
| paste -s -d " " \
| sed -n -e 's!.*<head[^>]*>\(.*\)</head>.*!\1!p' \
| sed -n -e 's!.*<title>\(.*\)</title>.*!\1!p'
but I am sure we can find a way to break it. This is why a true xml parser is the right solution, but as your question is tagged shell
, the above it the best I can come with.
The paste
and the 2 sed
can be merged in a single sed, but is less readable. However, this version has the advantage of working on multi-line titles:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;T;s!.*<title>\(.*\)</title>.*!\1!p}'
Update:
As explain in the comments, the last sed above uses the T
command which is a GNU extension. If you do not have a compatible version, you can use:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;tnext;b;:next;s!.*<title>\(.*\)</title>.*!\1!p}'
Update 2:
As above still not working on Mac, try:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;tnext};b;:next;s!.*<title>\(.*\)</title>.*!\1!p'
and/or
cat << EOF > script
H
\$x
\$s!.*<head[^>]*>\(.*\)</head>.*!\1!
\$tnext
b
:next
s!.*<title>\(.*\)</title>.*!\1!p
EOF
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -f script
(Note the \
before the $
to avoid variable expansion.)
It seams that the :next
does not like to be prefixed by a $
, which could be a problem in some sed version.
Upvotes: 12