Reputation: 67
Currently scraping a real estate website that is using javascript. My process starts by scraping a list containing many different href links for single listings, appending these links to another list and then pressing the next button. I do this til the the next button is no longer clickable.
my problem is that after collecting all the listings (~13000 links) the scraper doesn't move onto the second part where it opens the links and gets the info I need. Selenium doesn't even open to move onto the first element of the list of links.
heres my code:
wait = WebDriverWait(driver, 10)
while True:
try:
element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next')))
html = driver.page_source
soup = bs.BeautifulSoup(html,'html.parser')
table = soup.find(id = 'search_main_div')
classtitle = table.find_all('p', class_= 'title')
for aaa in classtitle:
hrefsyo = aaa.find('a', href = True)
linkstoclick = hrefsyo.get('href')
houselinklist.append(linkstoclick)
element.click()
except:
pass
After this I have another simple scraper that goes through the list of listings, opens them in selenium and collects data on that listing.
for links in houselinklist:
print(links)
newwebpage = links
driver.get(newwebpage)
html = driver.page_source
soup = bs.BeautifulSoup(html,'html.parser')
.
.
.
. more code here
Upvotes: 0
Views: 548
Reputation: 9274
The problem is while True:
creates a loop that runs infinity. Your except
clause has a pass
statement, which means once an error occurs, the loop just continues to run. Instead it can be written as
wait = WebDriverWait(driver, 10)
while True:
try:
element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next')))
html = driver.page_source
soup = bs.BeautifulSoup(html,'html.parser')
table = soup.find(id = 'search_main_div')
classtitle = table.find_all('p', class_= 'title')
for aaa in classtitle:
hrefsyo = aaa.find('a', href = True)
linkstoclick = hrefsyo.get('href')
houselinklist.append(linkstoclick)
element.click()
except:
break # change this to exit loop
once an error occurs, the loop will break
and move on to the next line of code
or you can just you can eliminate the while loop and just loop over your list of href links with a for loop
wait = WebDriverWait(driver, 10)
hrefLinks = ['link1','link2','link3'.....]
for link in hrefLinks:
try:
driver.get(link)
element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next')))
html = driver.page_source
soup = bs.BeautifulSoup(html,'html.parser')
table = soup.find(id = 'search_main_div')
classtitle = table.find_all('p', class_= 'title')
for aaa in classtitle:
hrefsyo = aaa.find('a', href = True)
linkstoclick = hrefsyo.get('href')
houselinklist.append(linkstoclick)
element.click()
except:
pass #pass on error and move on to next hreflink
Upvotes: 1