Reputation: 79
I am writing a bash script to download current natgeo photo of the day html web page using wget, it changes everyday. When I go to link https://www.nationalgeographic.com/photography/photo-of-the-day/ it redirects me to current page that is https://www.nationalgeographic.com/photography/photo-of-the-day/2018/08/mandalay-golden-sunrise/ that part after photo of the day changes everyday in website. I want wget to download the 2nd html link (which changes everyday) using only the 1st link( which when typed in browser redirects me to the 2nd link). How can I do it?
till now I have tried:
wget https://www.nationalgeographic.com/photography/photo-of-the-day/
but it does not give me the desired 2nd link html page.
Upvotes: 1
Views: 905
Reputation: 3816
This will work for you, a nice and easy single line code.
curl https://www.nationalgeographic.com/photography/photo-of-the-day/ | grep -m 1 https://www.nationalgeographic.com/photography/photo-of-the-day/ | cut -d '=' -f 3 |head -c-3 > desired_url
it will write the url u are looking for to a file named desired_url:
the file will look something like:
"https://www.nationalgeographic.com/photography/photo-of-the-day/2018/08/mandalay-golden-sunrise/"
which is your desired url.
To download the file u just have to do a:
url=`cat desired_url`
wget "$url"
Upvotes: 2
Reputation: 1
If you strictly wish to use wget you will have to download the page of the first URL to acquire the address that changes every day.Since we will not use the downloaded page for anything else we can just download it to /tmp. I am renaming the downloaded page file to NG.html
wget https://www.nationalgeographic.com/photography/photo-of-the-day -O /tmp/NG.html
I assume that the URL you want is the direct link to the picture which in this case is
How do we get that?
One way is to grep for the tag with "twitter:url" and print one line below it.
grep -A 1 twitter:url /tmp/NG.html
The "-A 1" parameter prints one more line after the line containing the pattern we searched for. The result is like this:
grep -A 1 twitter:url /tmp/NG.html
<meta property="twitter:url" content="https://www.nationalgeographic.com/photography/photo-of-the-day/2018/08/mandalay-golden-sunrise/"/>
<meta property="og:image" content="https://yourshot.nationalgeographic.com/u/fQYSUbVfts-T7odkrFJckdiFeHvab0GWOfzhj7tYdC0uglagsDNcYRm8vejuXg0QxTzqdASwYhpl6e-h74GxPyqutLd15lrhO2QpHQIwDhQhoBQJTxpBSU4oz1-dHfqeGM_woeke6FIaD5wOrPsDo_UOe_nesId87TLVU8qeMyW07MHDznqt_vj5hZAtvQEpuBxw4bZQEeUoPC_zgoESthc9dS8cSTY2RA/"/>
Now we can grep for "og:image" to select only the line that contains our URL. We could not grep for "og:image" before because there are other tags in the document with "og:image" in it.
So now we will get only the last line containing the URL:
grep -A 1 twitter:url /tmp/NG.html | grep "og:image"
<meta property="og:image" content="https://yourshot.nationalgeographic.com/u/fQYSUbVfts-T7odkrFJckdiFeHvab0GWOfzhj7tYdC0uglagsDNcYRm8vejuXg0QxTzqdASwYhpl6e-h74GxPyqutLd15lrhO2QpHQIwDhQhoBQJTxpBSU4oz1-dHfqeGM_woeke6FIaD5wOrPsDo_UOe_nesId87TLVU8qeMyW07MHDznqt_vj5hZAtvQEpuBxw4bZQEeUoPC_zgoESthc9dS8cSTY2RA/"/>
Now we can use cut to extract the URL from inside the HTML tag
if we use the '"' symbol as delimiter(separator), the 4th field will be the URL:
1 <meta property=
2 og:image
3 content=
4 https://yourshot.nationalgeographic.com/u/fQYSUbVfts-T7odkrFJckdiFeHvab0GWOfzhj7tYdC0uglagsDNcYRm8vejuXg0QxTzqdASwYhpl6e-h74GxPyqutLd15lrhO2QpHQIwDhQhoBQJTxpBSU4oz1-dHfqeGM_woeke6FIaD5wOrPsDo_UOe_nesId87TLVU8qeMyW07MHDznqt_vj5hZAtvQEpuBxw4bZQEeUoPC_zgoESthc9dS8cSTY2RA/
5 />
So now applying cut with delimiter '"' and selecting the 4th field it gives us:
grep -A 1 twitter:url /tmp/NG.html | grep "og:image" | cut -d '"' -f 4
https://yourshot.nationalgeographic.com/u/fQYSUbVfts-T7odkrFJckdiFeHvab0GWOfzhj7tYdC0uglagsDNcYRm8vejuXg0QxTzqdASwYhpl6e-h74GxPyqutLd15lrhO2QpHQIwDhQhoBQJTxpBSU4oz1-dHfqeGM_woeke6FIaD5wOrPsDo_UOe_nesId87TLVU8qeMyW07MHDznqt_vj5hZAtvQEpuBxw4bZQEeUoPC_zgoESthc9dS8cSTY2RA/
Now we can supply this URL to wget and save it as jpg
wget $( grep -A 1 twitter:url /tmp/NG.html | grep "og:image" | cut -d '"' -f 4) -O image.jpg
In Summary, you will need to run 2 lines:
wget https://www.nationalgeographic.com/photography/photo-of-the-day -O /tmp/NG.html
wget $( grep -A 1 twitter:url /tmp/NG.html | grep "og:image" | cut -d '"' -f 4) -O image.jpg
Upvotes: 0
Reputation: 23774
Try this:
#! /bin/bash
url=https://www.nationalgeographic.com/photography/photo-of-the-day/
wget -q -O- "$url" > index.html
og_url=$(xmllint --html --xpath 'string(//meta[@property="og:url"]/@content)' index.html 2>/dev/null)
og_image=$(xmllint --html --xpath 'string(//meta[@property="og:image"]/@content)' index.html 2>/dev/null)
rm index.html
name=${og_url%/}
name=${name##*/}
file="$name".jpg
wget -q -O "$file" "$og_image"
echo "$file"
First it loads the base URL. Then it uses xmllint to extract the relevant information. Standard error gets ignored, because the HTML code contains many errors. But xmllint
is still able to parse the relevant parts of the HTML page. The name of the image is part of an URL, which is stored in the value of the attribute content
in a meta
element with the attribute property=og:url
. The URL of the image is stored in the a similar meta
element with the attribute property=og:image
. Bash's parameter substitution is used to craft a file name. File name and URL are used in the second wget
to load the image. Finally the script reports the name of the created file.
Upvotes: 1