Reputation: 137
I've written a script in python
in combination with selenium
to scrape the links of different posts from different pages while clicking on the next page button and get the title of each post from its inner page. Although the content I'm trying to deal here are static ones, I used selenium to see how it parses items while clicking on the next pages. I'm only after any soultion related to selenium.
If I define a blank list and extend all the links to it then eventually I can parse all the titles reusing those links from their inner pages when clicking on the next page button is done but that is not what I want.
However, what I intend to do is collect all the links from each of the pages and parse title of each post from their inner pages while clicking on the next page button. In short, I wish do the two things simultaneously.
I've tried with:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://stackoverflow.com/questions/tagged/web-scraping"
def get_links(url):
driver.get(url)
while True:
items = [item.get_attribute("href") for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary .question-hyperlink")))]
yield from get_info(items)
try:
elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
driver.execute_script("arguments[0].scrollIntoView();",elem)
elem.click()
time.sleep(2)
except Exception:
break
def get_info(links):
for link in links:
driver.get(link)
name = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.question-hyperlink"))).text
yield name
if __name__ == '__main__':
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
for item in get_links(link):
print(item)
When I run the above script, It parses the title of different posts by reusing the link from the first page but breaks throwing this error raise TimeoutException(message, screen, stacktrace)
when it hits this elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
line.
How can scrape the title of each post from their inner pages collecting link from first page and then click on the next page button in order to repeat the process until it is done?
Upvotes: 0
Views: 98
Reputation: 33384
The reason you are getting no next button because when traverse each inner link at the end of that loop it can't find the next button.
You need to take each nexturl like below and execute.
urlnext = 'https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page={}&pagesize=30'.format(pageno) #where page will start from 2
Try below code.
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://stackoverflow.com/questions/tagged/web-scraping"
def get_links(url):
urlnext = 'https://stackoverflow.com/questions/tagged/web-scraping?tab=newest&page={}&pagesize=30'
npage = 2
driver.get(url)
while True:
items = [item.get_attribute("href") for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".summary .question-hyperlink")))]
yield from get_info(items)
driver.get(urlnext.format(npage))
try:
elem = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR,".pager > a[rel='next']")))
npage=npage+1
time.sleep(2)
except Exception:
break
def get_info(links):
for link in links:
driver.get(link)
name = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "a.question-hyperlink"))).text
yield name
if __name__ == '__main__':
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
for item in get_links(link):
print(item)
Upvotes: 1