Reputation: 113
I'm trying to scrape real estate data off of this website: example As you can see the relevant content is placed into article tags.
I'm running selenium with phantomjs:
driver = webdriver.PhantomJS(executable_path=PJSpath)
Then I generate the URL in python, because all search results are part of the link, so I can search what I'm looking for in the program without needing to fill out the form.
Before calling
driver.get(engine_link)
I copy engine_link to the clipboard and it opens fine in chrome. Next I wait for all possible redirects to happen:
def wait_for_redirect(wdriver):
elem = wdriver.find_element_by_tag_name("html")
count = 0
while True:
count += 1
if count > 5:
print("Waited for redirect for 5 seconds!")
return
time.sleep(1)
try:
elem = wdriver.find_element_by_tag_name("html")
except StaleElementReferenceException:
return
Now at last I want to iterate over all <article>
tags on the current page:
for article in driver.find_elements_by_tag_name("article"):
But this loop never returns anything. The program doesn't find any article tags, I've tried it with xpath and css selectors. Moreover, the articles are enclosed in a section tag, that can't be found either.
Is there a problem with this specific type of tags in Selenium or am I missing something JS related here? At the bottom of the page there are JavaScript templates whose naming suggests that they generate the search results.
Any help appreciated!
Upvotes: 1
Views: 1792
Reputation: 474191
Pretend not to be PhantomJS
and add an Explicit Wait (worked for me):
from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# set a custom user-agent
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36"
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = user_agent
driver = webdriver.PhantomJS(desired_capabilities=dcap)
driver.get("http://www.seloger.com/list.htm?cp=40250&org=advanced_search&idtt=2&pxmin=50000&pxmax=200000&surfacemin=20&surfacemax=100&idtypebien=2&idtypebien=1&idtypebien=11")
# wait for arcitles to be present
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.TAG_NAME, "article")))
# get articles
for article in driver.find_elements_by_tag_name("article"):
print(article.text)
Upvotes: 1