Misha Krul
Misha Krul

Reputation: 397

Scraping each image from a craigslist search

I'm trying to pull each image url from a craigslist search, but can't seem to drill down to the URL itself. When I try soup.find_all("a", { "class":"result-image gallery"} )[0].img, it doesn't return anything.

Specifically, the page I am trying to scrape is https://raleigh.craigslist.org/search/rea?query=duplex&sort=date&availabilityMode=0&sale_date=all+dates.

I'm trying to get the image at the following src: https://images.craigslist.org/00j0j_cC4PhAMdHLj_300x300.jpg

The super frustrating thing is that I was able to successfully do this yesterday, but didn't commit that working code to Github at the time. I have since accidentally deleted it and can't figure out what I had originally done to make this work :(

Upvotes: 2

Views: 1131

Answers (4)

bharatk
bharatk

Reputation: 4315

You should try automation selenium library. it allows you to scrape dynamic rendering request(js or ajax) page data.

from selenium import webdriver
from bs4 import BeautifulSoup
import time
from bs4.element import Tag

driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.get('https://raleigh.craigslist.org/search/rea?query=duplex&sort=date&availabilityMode=0&sale_date=all+dates')
time.sleep(3)

soup = BeautifulSoup(driver.page_source,'lxml')
rowArray = soup.find_all("li", { "class":"result-row"})

for row in rowArray:
    img = row.find("img")
    if img is None:
        continue
    if isinstance(img,Tag) and img.has_attr("src"):
        print(img['src'])
        print("----------------") 

O/P:

https://images.craigslist.org/00U0U_azwRntzeNXr_300x300.jpg
----------------
https://images.craigslist.org/00101_h0xsGArMWPh_300x300.jpg
----------------
https://images.craigslist.org/00J0J_2EzptPF9ysn_300x300.jpg
----------------
https://images.craigslist.org/00101_2FiqAHsu509_300x300.jpg
----------------
https://images.craigslist.org/00D0D_jQbpUTsk6o3_300x300.jpg

where '/usr/bin/chromedriver' selenium web driver path.

Download selenium web driver for chrome browser:

http://chromedriver.chromium.org/downloads

Install web driver for chrome browser:

https://christopher.su/2015/selenium-chromedriver-ubuntu/

Selenium tutorial:

https://selenium-python.readthedocs.io/

Upvotes: 2

David La Grange
David La Grange

Reputation: 413

    import requests
    from bs4 import BeautifulSoup

    r = requests.get("url here")
    soup = BeautifulSoup(r.content, features="html.parser")
    image_link = soup.find("div", { "class":"slide first visible"} ).img["src"]

You may have to change the tag type (div) and the class identifier if craigslist changes the html layout.

Upvotes: 0

QHarr
QHarr

Reputation: 84465

You only need requests and the landing page.

You can construct from the ids on the page (And get all the images for each property)

The data-ids attribute provides a list of the ids for the associated images which you can use to construct each image url.

<a href="https://raleigh.craigslist.org/reo/d/rocky-mount-off-market-multifamily/6892616013.html" class="result-image gallery" data-ids="1:00j0j_cC4PhAMdHLj"><img alt="" class="" src="https://images.craigslist.org/00j0j_cC4PhAMdHLj_300x300.jpg">
    <span class="result-price">$99000</span>
</a>

from bs4 import BeautifulSoup as bs
import requests

image_url = 'https://images.craigslist.org/{}_300x300.jpg'
r = requests.get('https://raleigh.craigslist.org/search/rea?query=duplex&sort=date&availabilityMode=0&sale_date=all+dates')
soup = bs(r.content, 'lxml')
ids = [item['data-ids'].replace('1:','') for item in soup.select('.result-image[data-ids]')] 
images = [image_url.format(j) for i in ids for j in i.split(',')]
print(images)

Upvotes: 1

mr.mams
mr.mams

Reputation: 462

It seems you are trying to get only the first image url. Therefore, you can just use find instead of find_all.

Also, to get the URL, you need to get the src attribute from img as well.

soup.find("a", { "class":"result-image gallery"} ).img["src"]

Upvotes: 0

Related Questions