Reputation:
I'm trying to extract image links from GoT wiki page The first two links work find but the second two give me a 404 error code. I'm trying to find out what I'm doing wrong.
I've tried different combinations to come up with the right link.
import requests
from bs4 import BeautifulSoup
import urllib
import urllib.request as request
import re
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
soup = BeautifulSoup(html_contents, 'html.parser')
# Find all a tags in the soup
for a in soup.find_all('a'):
# While looping through the text if you find img in 'a' tag
# Then print the src attribute
if a.img:
print('http:/'+a.img['src'])
# And here are the images on the page
http:///upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png
http://static/images/wikimedia-button.png
http://static/images/poweredby_mediawiki_88x31.png
The first two links work
But I want to get the second two links to work as well.
Upvotes: 1
Views: 103
Reputation: 142661
These urls starts with /
so they are without domain and you have to add https://en.wikipedia.org
to get full URLs like https://en.wikipedia.org/static/images/wikimedia-button.png
More or less:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
for a in soup.find_all('a'):
if a.img:
src = a.img['src']
if src.startswith('http'):
print(src)
elif src.startswith('//'):
print('https:' + src)
elif src.startswith('/'):
print('https://en.wikipedia.org' + src)
else:
print('https://en.wikipedia.org/w/' + src)
EDIT: you can also use urllib.parse.urljoin()
import requests
from bs4 import BeautifulSoup
import urllib.parse
url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
for a in soup.find_all('a'):
if a.img:
src = a.img['src']
print(urllib.parse.urljoin('https://en.wikipedia.org', src))
Upvotes: 0
Reputation:
Thanks for the help. I kept it simple. Here is what worked for me:
# Find all a tags in the soup
for a in soup.find_all('a'):
# While looping through the text if you find img in 'a' tag
# Then print the src attribute
if a.img:
if a.img['src'][:2] == '//':
print('https:'+a.img['src'])
else:
print('https://en.wikipedia.org/'+a.img['src'])
# And here are the images on the page
Upvotes: 1