Extracting image links using BeautifulSoup

Question

I'm trying to extract image links from GoT wiki page The first two links work find but the second two give me a 404 error code. I'm trying to find out what I'm doing wrong.

I've tried different combinations to come up with the right link.

import requests
from bs4 import BeautifulSoup
import urllib
import urllib.request as request
import re

url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'

r = requests.get(url)
html_contents = r.text
soup = BeautifulSoup(html_contents, 'html.parser')

# Find all a tags in the soup 
for a in soup.find_all('a'):
    # While looping through the text if you find img in 'a' tag
    # Then print the src attribute
    if a.img: 
        print('http:/'+a.img['src'])
# And here are the images on the page

http:///upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png

http:///upload.wikimedia.org/wikipedia/commons/thumb/2/2e/Game_of_Thrones_2011_logo.svg/300px-Game_of_Thrones_2011_logo.svg.png

http://static/images/wikimedia-button.png

http://static/images/poweredby_mediawiki_88x31.png

The first two links work

But I want to get the second two links to work as well.

furas · Accepted Answer

These urls starts with / so they are without domain and you have to add https://en.wikipedia.org to get full URLs like https://en.wikipedia.org/static/images/wikimedia-button.png

More or less:

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

for a in soup.find_all('a'):
    if a.img:
        src = a.img['src']
        if src.startswith('http'):
            print(src)
        elif src.startswith('//'):
            print('https:' + src)
        elif src.startswith('/'):
            print('https://en.wikipedia.org' + src)
        else:
            print('https://en.wikipedia.org/w/' + src)

EDIT: you can also use urllib.parse.urljoin()

import requests
from bs4 import BeautifulSoup
import urllib.parse

url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

for a in soup.find_all('a'):
    if a.img:
        src = a.img['src']
        print(urllib.parse.urljoin('https://en.wikipedia.org', src))

Extracting image links using BeautifulSoup

Answers (2)

Related Questions