Reputation: 13
Am trying to write python script to Get list of images from pages it worked for all pages expect this page The URLno idea why
from bs4 import *
import requests
import os
def main(url):
# content of URL
r = requests.get(url)
# Parse HTML Code
soup = BeautifulSoup(r.text, 'html.parser')
# find all images in URL
images = soup.findAll('img')
# take url
url = "https://www.olx.com.eg/ad/-IDbWEaD.html"
# CALL MAIN FUNCTION
main(url)
when i tried to trace the code i found it the request always gives
<Response [503]>
Upvotes: 0
Views: 66
Reputation: 11515
You've to include the User-Agent
according to RFC 7231
Please note the following point:
.jpg
.import requests
import re
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0'
}
def main(url):
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
print([x['src'] for x in soup.find_all('img', src=re.compile(".jpg$"))])
main('https://www.olx.com.eg/en/ad/iphone-11-pro-max-256-99-IDbVsfw.html')
Output:
['https://apollo-ireland.akamaized.net/v1/files/d8ny85zq521f3-EG/image;s=861x156;olx-st/_2_.jpg', 'https://olxegstatic-a.akamaihd.net/c753187-12042/naspersclassifieds-regional/olxmena-atlas-web/static/img/takovr/en/700x500.jpg', 'https://apollo-ireland.akamaized.net/v1/files/fdx5st507obs-EG/image;s=644x461;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/9a8up5kipf3w3-EG/image;s=644x461;olx-st/_2_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/33gbpla9epxt3-EG/image;s=261x203;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/36phurpqbyd52-EG/image;s=261x203;olx-st/_3_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/t93mj7wc7tf2-EG/image;s=261x203;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/wyp080mr7p3d3-EG/image;s=261x203;olx-st/_2_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/oaqb85wgs8qf1-EG/image;s=261x203;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/fdx5st507obs-EG/image;s=261x203;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/9a8up5kipf3w3-EG/image;s=261x203;olx-st/_2_.jpg']
If I'm blind and don't know the exact tags:
So let's use regex
as below check online:
match = re.findall(r'\"(http.*?jpg)\"', r.text)
print(match)
Output:
['https://apollo-ireland.akamaized.net/v1/files/fdx5st507obs-EG/image;s=644x461;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/fdx5st507obs-EG/image;s=644x461;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/d8ny85zq521f3-EG/image;s=861x156;olx-st/_2_.jpg', 'https://olxegstatic-a.akamaihd.net/c753187-12042/naspersclassifieds-regional/olxmena-atlas-web/static/img/takovr/en/700x500.jpg', 'https://apollo-ireland.akamaized.net/v1/files/fdx5st507obs-EG/image;s=644x461;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/9a8up5kipf3w3-EG/image;s=644x461;olx-st/_2_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/33gbpla9epxt3-EG/image;s=261x203;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/36phurpqbyd52-EG/image;s=261x203;olx-st/_3_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/t93mj7wc7tf2-EG/image;s=261x203;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/wyp080mr7p3d3-EG/image;s=261x203;olx-st/_2_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/oaqb85wgs8qf1-EG/image;s=261x203;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/fdx5st507obs-EG/image;s=1000x700;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/fdx5st507obs-EG/image;s=94x72;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/9a8up5kipf3w3-EG/image;s=1000x700;olx-st/_2_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/9a8up5kipf3w3-EG/image;s=94x72;olx-st/_2_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/fdx5st507obs-EG/image;s=261x203;olx-st/_1_.jpg', 'https://apollo-ireland.akamaized.net/v1/files/9a8up5kipf3w3-EG/image;s=261x203;olx-st/_2_.jpg']
Upvotes: 3