Extract JPG From Image Tag Src Attribute With BeautifulSoup

Question

I am scraping this webpage for personal use https://asheville.craigslist.org/search/fua and running into issues extracting the thumbnails of each item on the page. When I use "inspect" to view the html DOM I can view the image tags that contain the .jpg's I need, but when I use "view page source", img tags don't show up. At first I thought this might be an asynchronous javascript loading issue but I was told by a credible source I should be able to scrape the thumbnails directly with beautifulsoup.

import lxml
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent

ua = UserAgent()

    r = requests.get("https://asheville.craigslist.org/search/fua", params=dict(postal=28804), headers={"user-agent":ua.chrome})
    soup = BeautifulSoup(r.content, "lxml")
    for post in soup.find_all('li', "result-row"):
        for post_content in post.findAll("a", "result-image gallery"):
            print(post_content['href'])
            for pic in post_content.findAll("img", {'alt class': 'thumb'}):
                print(pic['src'])

Can someone clarify what I'm misunderstanding here? The value from the href attribute of the "a" tag will print but I can't seem to get the src attribute of the "img" tag to print. Thanks in advance!

briancaffey · Accepted Answer

I'm able to read the img tags with the following code:

for post in soup.find_all('li', "result-row"):
    for post_content in post.find_all("a", "result-image gallery"):
        print(post_content['href'])
        for pic in post_content.find_all("img"):
            print(pic['src'])

Just a few thoughts about scraping from craigslist:

Limit your requests per second. I have heard that craigslist will put a temporary block on your IP address if you exceed a certain frequency of requests.
Each posts seemed to load between one and two images. On closer inspection, the carousel images are not loaded in unless you click on the arrows. If you need each photo for each post, you should find a different way to write the script, possibly by visiting the link for each post that has multiple images.

Also, I think it's great to use selenium for web scraping. You may not need it for this project but it will allow you to do a lot more things like clicking on buttons, entering form data, etc. Here's the quick script I used to scrape the data using Selenium:

import lxml
import requests
from bs4 import BeautifulSoup
from selenium import webdriver

def test():
    url = "https://asheville.craigslist.org/search/fua"
    driver = webdriver.Firefox()
    driver.get(url)
    html = driver.page_source.encode('utf-8')
    soup = BeautifulSoup(html, "lxml")
    for post in soup.find_all('li', "result-row"):
        for post_content in post.find_all("a", "result-image gallery"):
            print(post_content['href'])
            for pic in post_content.find_all("img"):
                print(pic['src'])

Extract JPG From Image Tag Src Attribute With BeautifulSoup

Answers (1)

Related Questions