user7313804
user7313804

Reputation:

Extracting image links using BeautifulSoup

I'm trying to extract image links from GoT wiki page The first two links work find but the second two give me a 404 error code. I'm trying to find out what I'm doing wrong.

I've tried different combinations to come up with the right link.

import requests
from bs4 import BeautifulSoup
import urllib
import urllib.request as request
import re
url = 'https://en.wikipedia.org/w/index.php' + \
'?title=List_of_Game_of_Thrones_episodes&oldid=802553687'
r = requests.get(url)
html_contents = r.text
soup = BeautifulSoup(html_contents, 'html.parser')
# Find all a tags in the soup 
for a in soup.find_all('a'):
    # While looping through the text if you find img in 'a' tag
    # Then print the src attribute
    if a.img: 
        print('http:/'+a.img['src'])
# And here are the images on the page

http:///upload.wikimedia.org/wikipedia/en/thumb/e/e7/Cscr-featured.svg/20px-Cscr-featured.svg.png

http:///upload.wikimedia.org/wikipedia/commons/thumb/2/2e/Game_of_Thrones_2011_logo.svg/300px-Game_of_Thrones_2011_logo.svg.png

http://static/images/wikimedia-button.png

http://static/images/poweredby_mediawiki_88x31.png

The first two links work

But I want to get the second two links to work as well.

Upvotes: 1

Views: 103

Answers (2)

furas
furas

Reputation: 142661

These urls starts with / so they are without domain and you have to add https://en.wikipedia.org to get full URLs like https://en.wikipedia.org/static/images/wikimedia-button.png

More or less:

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

for a in soup.find_all('a'):
    if a.img:
        src = a.img['src']
        if src.startswith('http'):
            print(src)
        elif src.startswith('//'):
            print('https:' + src)
        elif src.startswith('/'):
            print('https://en.wikipedia.org' + src)
        else:
            print('https://en.wikipedia.org/w/' + src)

EDIT: you can also use urllib.parse.urljoin()

import requests
from bs4 import BeautifulSoup
import urllib.parse

url = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'

r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

for a in soup.find_all('a'):
    if a.img:
        src = a.img['src']
        print(urllib.parse.urljoin('https://en.wikipedia.org', src))

Upvotes: 0

user7313804
user7313804

Reputation:

Thanks for the help. I kept it simple. Here is what worked for me:

# Find all a tags in the soup 
for a in soup.find_all('a'):
    # While looping through the text if you find img in 'a' tag
    # Then print the src attribute
    if a.img:
        if a.img['src'][:2] == '//':
            print('https:'+a.img['src'])
        else:
            print('https://en.wikipedia.org/'+a.img['src'])
# And here are the images on the page

Upvotes: 1

Related Questions