Reputation: 21
I'm using BeautifulSoup to parse code from Craigslist. But when I'm using find_all command, I'm getting an empty list as output. If anyone could point out where I'm making a mistake or show me a better solution I would be grateful!
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
url = "https://vancouver.craigslist.org/search/cta?query=tesla#search=1~gallery~0~0"
driver = webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
sleep(4)
results = soup.find("div", class_="results")
print(results)
parent = results.find("ol")
# I can not accees <li> elements here
li_elements = parent.find_all("li", class_="cl-search-result cl-search-view-mode-gallery")
print(li_elements)
for li in li_elements:
print("not empty")
print(li.text)
Upvotes: 2
Views: 87
Reputation: 46
Selenium is a nice tool, but I hate using it for web scraping tasks, since it is so slow to load and often blocked by Cloudflare.
What I personally do in such situations is I try to find out the api call, that your browser makes to fetch the data. Just use your browser Developer tools, go to Network->Fetch/XHR and scan through all the request links that you'll find here. What you end up with is a bunch of raw json data that is sooo nice to work with.
So, as an example, this would give you the 15k-ish line json file with 359 different Teslas:
import requests, json
data = requests.get('https://sapi.craigslist.org/web/v8/postings/search/full?batch=16-0-360-0-0&cc=US&lang=en&query=tesla&searchPath=cta').json()
with open('craigs_data.json', 'w') as f:
json.dump(data, f)
What is even better, after you understand how the API call url`s are constructed, you can even make up the links by yourself. However, I must warn you.
API's are not intended to be directly called by users! Don't try to request them too often, unless you want your IP to get banned by their servers!
To me personally the 5 second pause between such requests is a gold standard. The more, the better.
P.S. Sorry, my answer has nothing to do with bs and selenium, but since you've asked for a better solution... :)
EDIT: The robots.txt file does not directly forbid to call their API, so technically, you are allowed to use it.
Upvotes: 0
Reputation: 3056
As Driftr95 mentioned, you need to wait until the contents get loaded on the page before getting the page source. This URL/website in particular is a bit slow in loading at the start.
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
url = "https://vancouver.craigslist.org/search/cta?query=tesla#search=1~gallery~0~0"
driver = webdriver.Chrome()
driver.get(url)
sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
results = soup.find("div", class_="results")
parent = results.find("ol")
li_elements = parent.find_all("li", class_="cl-search-result cl-search-view-mode-gallery")
for li in li_elements:
print(li.text)
Upvotes: 0