Reputation: 87
I'm trying to build a crawler with "selenium". How can I handle the paging part?
The following classes apply to the next and previous buttons in html paging:
The next button's class is review-next, The class of the previous button is review-prev.
If paging reaches the end, (when there is no review-next class which is the next button class) I want to go back and proceed with the crawl. (I try to go back from where I stopped, not the first)
Conversely, if there is no review-prev class that is the previous button class, it will go forward again.
In other words, you want to keep paging running repeatedly.
Below is my code so far.
*Explanation of additional questions.
first, If there is no next button (class review-next) on the current page I want to go back to the previous page and start crawling.
Even if the previous page has a next button (class review-next) From then on, we will try to crawl backwards.
to sum it up, If there is no next button (class review-next) When we go back, we go back even if there is a next button (class review-next).
<table>
<tbody>
<tr>
<td class="num">512</td>
<td class="thumb"><img src="test.jpg"></td>
<td class="subject">
<a href="/article/band/13538" id="re_href" class="re_href">Title</a>
</td>
<td class="writer"></td>
<td class="check"></td>
</tr>
<tr>
<td class="num">512</td>
<td class="thumb"><img src="test2.jpg"></td>
<td class="subject">
<a href="/article/band/14230" id="re_href" class="re_href">Title</a>
</td>
<td class="writer"></td>
<td class="check"></td>
</tr>
.
.
.
</tbody>
</table>
<div class="base-paginate">
<a href="?page=2" class="review-prev" title="prev-page"><img src="/btn_page_prev.gif" alt="prev-page"></a>
<ol>
<li><a href=""></a></li>
<li><a href=""></a></li>
<li><a href=""></a></li>
</ol>
<a href="?page=3" class="review-next" title="next-page"><img src="/btn_page_next.gif" alt="next-page"></a>
</div>
from time import sleep
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Chrome()
driver.set_page_load_timeout(60)
def close():
driver.get('/test&page=1')
def start():
driver.get('/test&page=1')
sleep(2)
list_of_links = []
while True:
list_of_links = driver.find_elements_by_xpath("//table//tr//td[@class='subject left txtBreak']/a")
sleep(2)
for linktext in range(len(list_of_links)):
list_of_links = driver.find_elements_by_xpath("//table//tr//td[@class='subject left txtBreak']/a")
element = list_of_links[linktext]
driver.execute_script("arguments[0].click();", element)
sleep(3)
driver.back()
sleep(3)
try:
driver.find_element_by_xpath("//a[@class='review-next']").click()
except NoSuchElementException :
break
list_of_links = set(list_of_links)
driver.close()
return list_of_links
if __name__ == '__main__':
list_of_links = start()
Upvotes: 0
Views: 116
Reputation: 545
If I understand you correctly, you are trying go back two pages, while you hitting a wall, so somethin like that(or edited) should work
type_of-button = "//a[@class='review-next']"
while True:
previous_url = driver.current_url
list_of_links = driver.find_elements_by_xpath("//table//tr//td[@class='subject left txtBreak']/a")
sleep(2)
for linktext in range(len(list_of_links)):
list_of_links = driver.find_elements_by_xpath("//table//tr//td[@class='subject left txtBreak']/a")
element = list_of_links[linktext]
driver.execute_script("arguments[0].click();", element)
sleep(3)
driver.back()
sleep(3)
try:
driver.find_element_by_xpath(type_of_button).click()
except NoSuchElementException :
driver.get(previous_url)
type_of_button = "//a[@class='review-prev']"
list_of_links = set(list_of_links)
driver.close()
return list_of_links
Also try not to use sleep. Read
and implement it, method sleep can create a lot of bugs.
Also now, there's no brake point, so you need to add some to avoid infinite loop
Upvotes: 1