memi
memi

Reputation: 13

Download images from pages doesn't work in some pages using BeautifulSoap

Am trying to write python script to Get list of images from pages it worked for all pages expect this page The URLno idea why

from bs4 import *
import requests
import os

def main(url):
    
    # content of URL
    r = requests.get(url)

    # Parse HTML Code
    soup = BeautifulSoup(r.text, 'html.parser')

    # find all images in URL
    images = soup.findAll('img')

# take url
url = "https://www.olx.com.eg/ad/-IDbWEaD.html"

# CALL MAIN FUNCTION
main(url)

when i tried to trace the code i found it the request always gives

<Response [503]>

Upvotes: 0

Views: 66

Answers (1)

You've to include the User-Agent according to RFC 7231

Please note the following point:

  1. Don't try/except if you already have a logic to match to increase the speed of your process. since you aware that all links ends with .jpg.
import requests
import re
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0'
}


def main(url):
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'lxml')
    print([x['src'] for x in soup.find_all('img', src=re.compile(".jpg$"))])


main('https://www.olx.com.eg/en/ad/iphone-11-pro-max-256-99-IDbVsfw.html')

Output:

['https://apollo-ireland.akamaized.net/v1/files/d8ny85zq521f3-EG/image;s=861x156;olx-st/_2_.jpg', 'https://olxegstatic-a.akamaihd.net/c753187-12042/naspersclassifieds-regional/olxmena-atlas-web/static/img/takovr/en/700x500.jpg', 'https://apollo-ireland.akamaized.net/v1/files/fdx5st507obs-EG/image;s=644x461;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/9a8up5kipf3w3-EG/image;s=644x461;olx-st/_2_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/33gbpla9epxt3-EG/image;s=261x203;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/36phurpqbyd52-EG/image;s=261x203;olx-st/_3_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/t93mj7wc7tf2-EG/image;s=261x203;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/wyp080mr7p3d3-EG/image;s=261x203;olx-st/_2_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/oaqb85wgs8qf1-EG/image;s=261x203;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/fdx5st507obs-EG/image;s=261x203;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/9a8up5kipf3w3-EG/image;s=261x203;olx-st/_2_.jpg']

If I'm blind and don't know the exact tags:

So let's use regex as below check online:

match = re.findall(r'\"(http.*?jpg)\"', r.text)
print(match)

Output:

['https://apollo-ireland.akamaized.net/v1/files/fdx5st507obs-EG/image;s=644x461;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/fdx5st507obs-EG/image;s=644x461;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/d8ny85zq521f3-EG/image;s=861x156;olx-st/_2_.jpg', 'https://olxegstatic-a.akamaihd.net/c753187-12042/naspersclassifieds-regional/olxmena-atlas-web/static/img/takovr/en/700x500.jpg', 'https://apollo-ireland.akamaized.net/v1/files/fdx5st507obs-EG/image;s=644x461;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/9a8up5kipf3w3-EG/image;s=644x461;olx-st/_2_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/33gbpla9epxt3-EG/image;s=261x203;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/36phurpqbyd52-EG/image;s=261x203;olx-st/_3_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/t93mj7wc7tf2-EG/image;s=261x203;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/wyp080mr7p3d3-EG/image;s=261x203;olx-st/_2_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/oaqb85wgs8qf1-EG/image;s=261x203;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/fdx5st507obs-EG/image;s=1000x700;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/fdx5st507obs-EG/image;s=94x72;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/9a8up5kipf3w3-EG/image;s=1000x700;olx-st/_2_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/9a8up5kipf3w3-EG/image;s=94x72;olx-st/_2_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/fdx5st507obs-EG/image;s=261x203;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/9a8up5kipf3w3-EG/image;s=261x203;olx-st/_2_.jpg']

Upvotes: 3

Related Questions