soria
soria

Reputation: 87

How do you handle paging in "selenium"?

I'm trying to build a crawler with "selenium". How can I handle the paging part?

The following classes apply to the next and previous buttons in html paging:

The next button's class is review-next, The class of the previous button is review-prev.

If paging reaches the end, (when there is no review-next class which is the next button class) I want to go back and proceed with the crawl. (I try to go back from where I stopped, not the first)

Conversely, if there is no review-prev class that is the previous button class, it will go forward again.

In other words, you want to keep paging running repeatedly.

Below is my code so far.

*Explanation of additional questions.

first, If there is no next button (class review-next) on the current page I want to go back to the previous page and start crawling.

Even if the previous page has a next button (class review-next) From then on, we will try to crawl backwards.

to sum it up, If there is no next button (class review-next) When we go back, we go back even if there is a next button (class review-next).

<table>
    <tbody>
        <tr>
            <td class="num">512</td>
            <td class="thumb"><img src="test.jpg"></td>
            <td class="subject">
                <a href="/article/band/13538" id="re_href" class="re_href">Title</a>
            </td>
            <td class="writer"></td>
            <td class="check"></td>
        </tr>
        <tr>
            <td class="num">512</td>
            <td class="thumb"><img src="test2.jpg"></td>
            <td class="subject">
                <a href="/article/band/14230" id="re_href" class="re_href">Title</a>
            </td>
            <td class="writer"></td>
            <td class="check"></td>
        </tr>
        .
        .
        .
    </tbody>
</table>

<div class="base-paginate">
    <a href="?page=2" class="review-prev" title="prev-page"><img src="/btn_page_prev.gif" alt="prev-page"></a>
    <ol>
        <li><a href=""></a></li>
        <li><a href=""></a></li>
        <li><a href=""></a></li>
    </ol>
    <a href="?page=3" class="review-next" title="next-page"><img src="/btn_page_next.gif" alt="next-page"></a>
</div>

from time import sleep
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException

driver = webdriver.Chrome()
driver.set_page_load_timeout(60)

def close():
    driver.get('/test&page=1')

def start():
    driver.get('/test&page=1')
    sleep(2)

    list_of_links = []

    while True:

        list_of_links = driver.find_elements_by_xpath("//table//tr//td[@class='subject left txtBreak']/a")
        sleep(2)

        for linktext in range(len(list_of_links)):
            list_of_links = driver.find_elements_by_xpath("//table//tr//td[@class='subject left txtBreak']/a")
            element = list_of_links[linktext]
            driver.execute_script("arguments[0].click();", element)
            sleep(3)
            driver.back()
            sleep(3)

        try:
            driver.find_element_by_xpath("//a[@class='review-next']").click()

        except NoSuchElementException :
            break

    list_of_links = set(list_of_links)

    driver.close()

    return list_of_links

if __name__ == '__main__':
    list_of_links = start()

Upvotes: 0

Views: 116

Answers (1)

Misieq
Misieq

Reputation: 545

If I understand you correctly, you are trying go back two pages, while you hitting a wall, so somethin like that(or edited) should work

   type_of-button = "//a[@class='review-next']"
   while True:
            previous_url = driver.current_url    
            list_of_links = driver.find_elements_by_xpath("//table//tr//td[@class='subject left txtBreak']/a")
            sleep(2)

            for linktext in range(len(list_of_links)):
                list_of_links = driver.find_elements_by_xpath("//table//tr//td[@class='subject left txtBreak']/a")
                element = list_of_links[linktext]
                driver.execute_script("arguments[0].click();", element)
                sleep(3)
                driver.back()
                sleep(3)

            try:
                driver.find_element_by_xpath(type_of_button).click()

            except NoSuchElementException :
                driver.get(previous_url)
                type_of_button = "//a[@class='review-prev']" 


        list_of_links = set(list_of_links)

        driver.close()

        return list_of_links

Also try not to use sleep. Read

https://selenium-python.readthedocs.io/waits.html

and implement it, method sleep can create a lot of bugs.

Also now, there's no brake point, so you need to add some to avoid infinite loop

Upvotes: 1

Related Questions