BeautifulSoup children of ordered list <ol>, no results

I'm using BeautifulSoup to parse code from Craigslist. But when I'm using find_all command, I'm getting an empty list as output. If anyone could point out where I'm making a mistake or show me a better solution I would be grateful!

from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep

url = "https://vancouver.craigslist.org/search/cta?query=tesla#search=1~gallery~0~0"

driver = webdriver.Chrome()
driver.get(url)

soup = BeautifulSoup(driver.page_source, "html.parser")

sleep(4)

results = soup.find("div", class_="results")
print(results)

parent = results.find("ol")

# I can not accees <li> elements here
li_elements = parent.find_all("li", class_="cl-search-result cl-search-view-mode-gallery")
print(li_elements)

for li in li_elements:
 print("not empty")
 print(li.text)

Upvotes: 2

Answers (2)

Rubisko

Reputation: 46

Selenium is a nice tool, but I hate using it for web scraping tasks, since it is so slow to load and often blocked by Cloudflare.

What I personally do in such situations is I try to find out the api call, that your browser makes to fetch the data. Just use your browser Developer tools, go to Network->Fetch/XHR and scan through all the request links that you'll find here. What you end up with is a bunch of raw json data that is sooo nice to work with.

So, as an example, this would give you the 15k-ish line json file with 359 different Teslas:

import requests, json

data = requests.get('https://sapi.craigslist.org/web/v8/postings/search/full?batch=16-0-360-0-0&cc=US&lang=en&query=tesla&searchPath=cta').json()

with open('craigs_data.json', 'w') as f:
    json.dump(data, f)

What is even better, after you understand how the API call url`s are constructed, you can even make up the links by yourself. However, I must warn you.

API's are not intended to be directly called by users! Don't try to request them too often, unless you want your IP to get banned by their servers!

To me personally the 5 second pause between such requests is a gold standard. The more, the better.

P.S. Sorry, my answer has nothing to do with bs and selenium, but since you've asked for a better solution... :)

EDIT: The robots.txt file does not directly forbid to call their API, so technically, you are allowed to use it.

Upvotes: 0

Ajeet Verma

Reputation: 3056

As Driftr95 mentioned, you need to wait until the contents get loaded on the page before getting the page source. This URL/website in particular is a bit slow in loading at the start.

from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep

url = "https://vancouver.craigslist.org/search/cta?query=tesla#search=1~gallery~0~0"

driver = webdriver.Chrome()
driver.get(url)
sleep(5)

soup = BeautifulSoup(driver.page_source, "html.parser")

results = soup.find("div", class_="results")
parent = results.find("ol")
li_elements = parent.find_all("li", class_="cl-search-result cl-search-view-mode-gallery")

for li in li_elements:
    print(li.text)

Upvotes: 0

BeautifulSoup children of ordered list &lt;ol&gt;, no results

Answers (2)

Related Questions

BeautifulSoup children of ordered list <ol>, no results