Romal Jaiswal
Romal Jaiswal

Reputation: 79

download an html page using wget with only partial link

I am writing a bash script to download current natgeo photo of the day html web page using wget, it changes everyday. When I go to link https://www.nationalgeographic.com/photography/photo-of-the-day/ it redirects me to current page that is https://www.nationalgeographic.com/photography/photo-of-the-day/2018/08/mandalay-golden-sunrise/ that part after photo of the day changes everyday in website. I want wget to download the 2nd html link (which changes everyday) using only the 1st link( which when typed in browser redirects me to the 2nd link). How can I do it?

till now I have tried:

wget  https://www.nationalgeographic.com/photography/photo-of-the-day/ 

but it does not give me the desired 2nd link html page.

Upvotes: 1

Views: 905

Answers (3)

Inder
Inder

Reputation: 3816

This will work for you, a nice and easy single line code.

curl https://www.nationalgeographic.com/photography/photo-of-the-day/ | grep -m 1 https://www.nationalgeographic.com/photography/photo-of-the-day/ | cut -d '=' -f 3 |head -c-3 > desired_url

it will write the url u are looking for to a file named desired_url:

the file will look something like:

"https://www.nationalgeographic.com/photography/photo-of-the-day/2018/08/mandalay-golden-sunrise/"

which is your desired url.

To download the file u just have to do a:

url=`cat desired_url`

wget "$url"

Upvotes: 2

DMA
DMA

Reputation: 1

If you strictly wish to use wget you will have to download the page of the first URL to acquire the address that changes every day.Since we will not use the downloaded page for anything else we can just download it to /tmp. I am renaming the downloaded page file to NG.html

wget https://www.nationalgeographic.com/photography/photo-of-the-day -O /tmp/NG.html

I assume that the URL you want is the direct link to the picture which in this case is

https://yourshot.nationalgeographic.com/u/fQYSUbVfts-T7odkrFJckdiFeHvab0GWOfzhj7tYdC0uglagsDNcYRm8vejuXg0QxTzqdASwYhpl6e-h74GxPyqutLd15lrhO2QpHQIwDhQhoBQJTxpBSU4oz1-dHfqeGM_woeke6FIaD5wOrPsDo_UOe_nesId87TLVU8qeMyW07MHDznqt_vj5hZAtvQEpuBxw4bZQEeUoPC_zgoESthc9dS8cSTY2RA/

How do we get that?

One way is to grep for the tag with "twitter:url" and print one line below it.

grep -A 1 twitter:url  /tmp/NG.html

The "-A 1" parameter prints one more line after the line containing the pattern we searched for. The result is like this:

 grep -A 1 twitter:url  /tmp/NG.html
<meta property="twitter:url" content="https://www.nationalgeographic.com/photography/photo-of-the-day/2018/08/mandalay-golden-sunrise/"/>
<meta property="og:image" content="https://yourshot.nationalgeographic.com/u/fQYSUbVfts-T7odkrFJckdiFeHvab0GWOfzhj7tYdC0uglagsDNcYRm8vejuXg0QxTzqdASwYhpl6e-h74GxPyqutLd15lrhO2QpHQIwDhQhoBQJTxpBSU4oz1-dHfqeGM_woeke6FIaD5wOrPsDo_UOe_nesId87TLVU8qeMyW07MHDznqt_vj5hZAtvQEpuBxw4bZQEeUoPC_zgoESthc9dS8cSTY2RA/"/>

Now we can grep for "og:image" to select only the line that contains our URL. We could not grep for "og:image" before because there are other tags in the document with "og:image" in it.

So now we will get only the last line containing the URL:

grep -A 1 twitter:url  /tmp/NG.html | grep "og:image"
<meta property="og:image" content="https://yourshot.nationalgeographic.com/u/fQYSUbVfts-T7odkrFJckdiFeHvab0GWOfzhj7tYdC0uglagsDNcYRm8vejuXg0QxTzqdASwYhpl6e-h74GxPyqutLd15lrhO2QpHQIwDhQhoBQJTxpBSU4oz1-dHfqeGM_woeke6FIaD5wOrPsDo_UOe_nesId87TLVU8qeMyW07MHDznqt_vj5hZAtvQEpuBxw4bZQEeUoPC_zgoESthc9dS8cSTY2RA/"/>

Now we can use cut to extract the URL from inside the HTML tag

if we use the '"' symbol as delimiter(separator), the 4th field will be the URL:

1 <meta property=
2 og:image
3 content=
4 https://yourshot.nationalgeographic.com/u/fQYSUbVfts-T7odkrFJckdiFeHvab0GWOfzhj7tYdC0uglagsDNcYRm8vejuXg0QxTzqdASwYhpl6e-h74GxPyqutLd15lrhO2QpHQIwDhQhoBQJTxpBSU4oz1-dHfqeGM_woeke6FIaD5wOrPsDo_UOe_nesId87TLVU8qeMyW07MHDznqt_vj5hZAtvQEpuBxw4bZQEeUoPC_zgoESthc9dS8cSTY2RA/
5 />

So now applying cut with delimiter '"' and selecting the 4th field it gives us:

 grep -A 1 twitter:url  /tmp/NG.html | grep "og:image" | cut -d '"' -f 4
https://yourshot.nationalgeographic.com/u/fQYSUbVfts-T7odkrFJckdiFeHvab0GWOfzhj7tYdC0uglagsDNcYRm8vejuXg0QxTzqdASwYhpl6e-h74GxPyqutLd15lrhO2QpHQIwDhQhoBQJTxpBSU4oz1-dHfqeGM_woeke6FIaD5wOrPsDo_UOe_nesId87TLVU8qeMyW07MHDznqt_vj5hZAtvQEpuBxw4bZQEeUoPC_zgoESthc9dS8cSTY2RA/

Now we can supply this URL to wget and save it as jpg

wget $( grep -A 1 twitter:url  /tmp/NG.html | grep "og:image" | cut -d '"' -f 4) -O image.jpg

In Summary, you will need to run 2 lines:

wget https://www.nationalgeographic.com/photography/photo-of-the-day -O /tmp/NG.html
wget $( grep -A 1 twitter:url  /tmp/NG.html | grep "og:image" | cut -d '"' -f 4) -O image.jpg

Upvotes: 0

ceving
ceving

Reputation: 23774

Try this:

#! /bin/bash

url=https://www.nationalgeographic.com/photography/photo-of-the-day/

wget -q -O- "$url" > index.html

og_url=$(xmllint --html --xpath 'string(//meta[@property="og:url"]/@content)' index.html 2>/dev/null)
og_image=$(xmllint --html --xpath 'string(//meta[@property="og:image"]/@content)' index.html 2>/dev/null)

rm index.html

name=${og_url%/}
name=${name##*/}
file="$name".jpg

wget -q -O "$file" "$og_image"
echo "$file"

First it loads the base URL. Then it uses xmllint to extract the relevant information. Standard error gets ignored, because the HTML code contains many errors. But xmllint is still able to parse the relevant parts of the HTML page. The name of the image is part of an URL, which is stored in the value of the attribute content in a meta element with the attribute property=og:url. The URL of the image is stored in the a similar meta element with the attribute property=og:image. Bash's parameter substitution is used to craft a file name. File name and URL are used in the second wget to load the image. Finally the script reports the name of the created file.

Upvotes: 1

Related Questions