Reputation: 164
I'm trying to parse 131 product links traversing all next pages from a webpage. The next page button does contain next page link but to form a full-fledged link out of it seems to be real hard.
I've tried so far with:
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
base = 'https://www.phoenixcontact.com{}'
link = 'https://www.phoenixcontact.com/online/portal/gb?1dmy&urile=wcm%3apath%3a/gben/web/main/products/list_pages/DC_charging_cables_P-10-11-01-01/aa4065f9-ec6c-4765-b2c7-d3b31d247fc6'
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}
def get_links(link):
r = requests.get(link,headers=headers)
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select("[class='pxc-sales-data-wrp'][data-product-key] h3 > a[href][onclick]"):
item_link = base.format(item.get("href"))
yield item_link
next_page = soup.select_one("[class='pxc-pager'] a[class='pxc-pager-next']")
if next_page:
next_page_link = urljoin(link,next_page.get("href"))
yield from get_links(next_page_link)
if __name__ == '__main__':
for elem in get_links(link):
print(elem)
The above approach gets me the links from first page over and over again instead of the links from next pages.
How can I get the links from next pages traversing next page button using requests?
Upvotes: 0
Views: 103
Reputation: 12672
You need to keep a session,otherwise you will stay at the first page.
You could get the base url by find the <base>
tag(It was saved in tag <base href="..">
). Try code below:
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}
link = 'https://www.phoenixcontact.com/online/portal/gb?1dmy&urile=wcm%3apath%3a/gben/web/main/products/list_pages/DC_charging_cables_P-10-11-01-01/aa4065f9-ec6c-4765-b2c7-d3b31d247fc6'
s = requests.Session()
s.headers.update(headers)
while True:
response = s.get(link)
soup = BeautifulSoup(response.text, "lxml")
base_url = soup.select_one("base").get("href")
next_page_element = soup.select_one(".pxc-pager-next")
if next_page_element is not None:
next_page_url = next_page_element.get("href")
link = base_url + next_page_url
print(link)
else:
break
Upvotes: 2