Reputation: 23
I'm currently working on a web scraper to download information off my school's newspaper's website to re-upload to our new upcoming website. Right now I'm currently testing how to download the images from the web page with bs4. However, as explained in my code below I'm unable to find the 'src' tag for the image aka the url in order to download the image.
import requests, bs4
url = 'https://www.behrendbeacon.com/parkingconcernsaddressed'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
imgElems = soup.select('img')
print(imgElem[2])
# prints <img alt="18160.jpeg" data-type="image" id="comp-jpa6qz48imgimage"/>
So for further explanation:
1.) If you go to the url and inspect the web page with the developers tools you will understand that imgElem[2] is the main image in the news article I'm trying to grab. Here's an image below to illustrate what I mean:
Here's the web page screenshot
2.) And the reason I'm print imgElem[2] is to demonstrate that Beautiful Soup doesn't grab the 'src' tag with the rest of the data
In short, can someone potentially explain what I'm missing out on? Could this inability to grab the 'src' tag lie in the fact that the website is a Wix site? Thank you for any help you can give
Upvotes: 2
Views: 526
Reputation: 28565
might just be a case that the page needs to render first because it's dynamic. I believe the package requests-html
link here can do that (although there seems to be a bug with it if you're trying to use it with Spyder. So I'm not too familiar with it.) At some point, I will have to learn/play around with it.
In the mean time, I've used Selenium to work with dynamic pages. Selenium worked for me on this one:
import bs4
from selenium import webdriver
url = 'https://www.behrendbeacon.com/parkingconcernsaddressed'
browser = webdriver.Chrome()
browser.get(url)
res = browser.page_source
soup = bs4.BeautifulSoup(res, 'html.parser')
imgElems = soup.find('img').get('src')
# print (imgElems)
# prints https://static.wixstatic.com/media/7384a7_7bb56fcbcb6c48c0875c93a2b6c9821c~mv2.jpg/v1/fill/
# w_820,h_151,al_c,q_80,usm_0.66_1.00_0.01/7384a7_7bb56fcbcb6c48c0875c93a2b6c9821c~mv2.webp
browser.close()
Upvotes: 3