bathtubandatoaster
bathtubandatoaster

Reputation: 67

Webscraping using selenium, beautifulsoup and python

Currently scraping a real estate website that is using javascript. My process starts by scraping a list containing many different href links for single listings, appending these links to another list and then pressing the next button. I do this til the the next button is no longer clickable.

my problem is that after collecting all the listings (~13000 links) the scraper doesn't move onto the second part where it opens the links and gets the info I need. Selenium doesn't even open to move onto the first element of the list of links.

heres my code:

wait = WebDriverWait(driver, 10)
while True:
    try:
        element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next')))
        html = driver.page_source
        soup = bs.BeautifulSoup(html,'html.parser')
        table = soup.find(id = 'search_main_div')
        classtitle =  table.find_all('p', class_= 'title')
        for aaa in classtitle:
            hrefsyo =  aaa.find('a', href = True)
            linkstoclick = hrefsyo.get('href')
            houselinklist.append(linkstoclick)
        element.click()
    except:
        pass

After this I have another simple scraper that goes through the list of listings, opens them in selenium and collects data on that listing.

for links in houselinklist:
    print(links)
    newwebpage = links
    driver.get(newwebpage)
    html = driver.page_source
    soup = bs.BeautifulSoup(html,'html.parser')
    .
    .
    .
    . more code here

Upvotes: 0

Views: 548

Answers (1)

DJK
DJK

Reputation: 9274

The problem is while True: creates a loop that runs infinity. Your except clause has a pass statement, which means once an error occurs, the loop just continues to run. Instead it can be written as

wait = WebDriverWait(driver, 10)
while True:
    try:
        element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next')))
        html = driver.page_source
        soup = bs.BeautifulSoup(html,'html.parser')
        table = soup.find(id = 'search_main_div')
        classtitle =  table.find_all('p', class_= 'title')
        for aaa in classtitle:
            hrefsyo =  aaa.find('a', href = True)
            linkstoclick = hrefsyo.get('href')
            houselinklist.append(linkstoclick)
        element.click()
    except:
        break # change this to exit loop

once an error occurs, the loop will break and move on to the next line of code

or you can just you can eliminate the while loop and just loop over your list of href links with a for loop

wait = WebDriverWait(driver, 10)
hrefLinks = ['link1','link2','link3'.....]
for link in hrefLinks:
    try:
        driver.get(link)
        element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'next')))
        html = driver.page_source
        soup = bs.BeautifulSoup(html,'html.parser')
        table = soup.find(id = 'search_main_div')
        classtitle =  table.find_all('p', class_= 'title')
        for aaa in classtitle:
            hrefsyo =  aaa.find('a', href = True)
            linkstoclick = hrefsyo.get('href')
            houselinklist.append(linkstoclick)
        element.click()
    except:
        pass #pass on error and move on to next hreflink

Upvotes: 1

Related Questions