Reputation: 397
I'm trying to pull each image url from a craigslist search, but can't seem to drill down to the URL itself. When I try soup.find_all("a", { "class":"result-image gallery"} )[0].img
, it doesn't return anything.
Specifically, the page I am trying to scrape is https://raleigh.craigslist.org/search/rea?query=duplex&sort=date&availabilityMode=0&sale_date=all+dates.
I'm trying to get the image at the following src
: https://images.craigslist.org/00j0j_cC4PhAMdHLj_300x300.jpg
The super frustrating thing is that I was able to successfully do this yesterday, but didn't commit that working code to Github at the time. I have since accidentally deleted it and can't figure out what I had originally done to make this work :(
Upvotes: 2
Views: 1131
Reputation: 4315
You should try automation selenium
library. it allows you to scrape dynamic rendering request(js or ajax) page data.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
from bs4.element import Tag
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.get('https://raleigh.craigslist.org/search/rea?query=duplex&sort=date&availabilityMode=0&sale_date=all+dates')
time.sleep(3)
soup = BeautifulSoup(driver.page_source,'lxml')
rowArray = soup.find_all("li", { "class":"result-row"})
for row in rowArray:
img = row.find("img")
if img is None:
continue
if isinstance(img,Tag) and img.has_attr("src"):
print(img['src'])
print("----------------")
O/P:
https://images.craigslist.org/00U0U_azwRntzeNXr_300x300.jpg
----------------
https://images.craigslist.org/00101_h0xsGArMWPh_300x300.jpg
----------------
https://images.craigslist.org/00J0J_2EzptPF9ysn_300x300.jpg
----------------
https://images.craigslist.org/00101_2FiqAHsu509_300x300.jpg
----------------
https://images.craigslist.org/00D0D_jQbpUTsk6o3_300x300.jpg
where '/usr/bin/chromedriver'
selenium web driver path.
Download selenium web driver for chrome browser:
http://chromedriver.chromium.org/downloads
Install web driver for chrome browser:
https://christopher.su/2015/selenium-chromedriver-ubuntu/
Selenium tutorial:
https://selenium-python.readthedocs.io/
Upvotes: 2
Reputation: 413
import requests
from bs4 import BeautifulSoup
r = requests.get("url here")
soup = BeautifulSoup(r.content, features="html.parser")
image_link = soup.find("div", { "class":"slide first visible"} ).img["src"]
You may have to change the tag type (div) and the class identifier if craigslist changes the html layout.
Upvotes: 0
Reputation: 84465
You only need requests and the landing page.
You can construct from the ids on the page (And get all the images for each property)
The data-ids
attribute provides a list of the ids for the associated images which you can use to construct each image url.
<a href="https://raleigh.craigslist.org/reo/d/rocky-mount-off-market-multifamily/6892616013.html" class="result-image gallery" data-ids="1:00j0j_cC4PhAMdHLj"><img alt="" class="" src="https://images.craigslist.org/00j0j_cC4PhAMdHLj_300x300.jpg">
<span class="result-price">$99000</span>
</a>
from bs4 import BeautifulSoup as bs
import requests
image_url = 'https://images.craigslist.org/{}_300x300.jpg'
r = requests.get('https://raleigh.craigslist.org/search/rea?query=duplex&sort=date&availabilityMode=0&sale_date=all+dates')
soup = bs(r.content, 'lxml')
ids = [item['data-ids'].replace('1:','') for item in soup.select('.result-image[data-ids]')]
images = [image_url.format(j) for i in ids for j in i.split(',')]
print(images)
Upvotes: 1
Reputation: 462
It seems you are trying to get only the first image url. Therefore, you can just use find
instead of find_all
.
Also, to get the URL, you need to get the src
attribute from img
as well.
soup.find("a", { "class":"result-image gallery"} ).img["src"]
Upvotes: 0