Mr. Demetrius Michael
Mr. Demetrius Michael

Reputation: 2406

Wget page title

Is it possible to Wget a page's title from the command line?

input:

$ wget http://bit.ly/rQyhG5 <<code>>

output:

If it’s broke, fix it right   - Keeping it Real Estate. Home

Upvotes: 2

Views: 7609

Answers (2)

PGillhaus
PGillhaus

Reputation: 11

The following will pull whatever lynx thinks the title of the page is, saving you from all of the regex nonsense. Assuming the page you are retrieving is standards compliant enough for lynx, this should not break.

lynx -dump example.com | sed '2q;d'

Upvotes: 0

jfg956
jfg956

Reputation: 16748

This script would give you what you need:

wget --quiet -O - http://bit.ly/rQyhG5 \
  | sed -n -e 's!.*<title>\(.*\)</title>.*!\1!p'

But there are lots of situations where it breaks, including if there is a <title>...</title> in the body of the page, or if the title is on more than one line.

This might be a little better:

wget --quiet -O - http://bit.ly/rQyhG5 \
  | paste -s -d " "  \
  | sed -e 's!.*<head>\(.*\)</head>.*!\1!' \
  | sed -e 's!.*<title>\(.*\)</title>.*!\1!'

but it does not fit your case as your page contains the following head opening:

<head profile="http://gmpg.org/xfn/11">

Again, this might be better:

wget --quiet -O - http://bit.ly/rQyhG5 \
  | paste -s -d " "  \
  | sed -e 's!.*<head[^>]*>\(.*\)</head>.*!\1!' \
  | sed -e 's!.*<title>\(.*\)</title>.*!\1!'

but there is still ways to break it, including no head/title in the page.

Again, a better solution might be:

wget --quiet -O - http://bit.ly/rQyhG5 \
  | paste -s -d " "  \
  | sed -n -e 's!.*<head[^>]*>\(.*\)</head>.*!\1!p' \
  | sed -n -e 's!.*<title>\(.*\)</title>.*!\1!p'

but I am sure we can find a way to break it. This is why a true xml parser is the right solution, but as your question is tagged shell, the above it the best I can come with.

The paste and the 2 sed can be merged in a single sed, but is less readable. However, this version has the advantage of working on multi-line titles:

wget --quiet -O - http://bit.ly/rQyhG5 \
  | sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;T;s!.*<title>\(.*\)</title>.*!\1!p}'

Update:

As explain in the comments, the last sed above uses the T command which is a GNU extension. If you do not have a compatible version, you can use:

wget --quiet -O - http://bit.ly/rQyhG5 \
  | sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;tnext;b;:next;s!.*<title>\(.*\)</title>.*!\1!p}'

Update 2:

As above still not working on Mac, try:

wget --quiet -O - http://bit.ly/rQyhG5 \
  | sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;tnext};b;:next;s!.*<title>\(.*\)</title>.*!\1!p'

and/or

cat << EOF > script
H
\$x
\$s!.*<head[^>]*>\(.*\)</head>.*!\1!
\$tnext
b
:next
s!.*<title>\(.*\)</title>.*!\1!p
EOF
wget --quiet -O - http://bit.ly/rQyhG5 \
  | sed -n -f script

(Note the \ before the $ to avoid variable expansion.)

It seams that the :next does not like to be prefixed by a $, which could be a problem in some sed version.

Upvotes: 12

Related Questions