MITHU
MITHU

Reputation: 164

Can't form a working url using some link connected to next page button from a webpage

I'm trying to parse 131 product links traversing all next pages from a webpage. The next page button does contain next page link but to form a full-fledged link out of it seems to be real hard.

webpage link

I've tried so far with:

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

base = 'https://www.phoenixcontact.com{}'
link = 'https://www.phoenixcontact.com/online/portal/gb?1dmy&urile=wcm%3apath%3a/gben/web/main/products/list_pages/DC_charging_cables_P-10-11-01-01/aa4065f9-ec6c-4765-b2c7-d3b31d247fc6'

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}

def get_links(link):
    r = requests.get(link,headers=headers)
    soup = BeautifulSoup(r.text,"lxml")
    for item in soup.select("[class='pxc-sales-data-wrp'][data-product-key] h3 > a[href][onclick]"):
        item_link = base.format(item.get("href"))
        yield item_link

    next_page = soup.select_one("[class='pxc-pager'] a[class='pxc-pager-next']")
    if next_page:
        next_page_link = urljoin(link,next_page.get("href"))
        yield from get_links(next_page_link)

if __name__ == '__main__':
    for elem in get_links(link):
        print(elem)

The above approach gets me the links from first page over and over again instead of the links from next pages.

How can I get the links from next pages traversing next page button using requests?

Upvotes: 0

Views: 103

Answers (1)

jizhihaoSAMA
jizhihaoSAMA

Reputation: 12672

You need to keep a session,otherwise you will stay at the first page.

You could get the base url by find the <base> tag(It was saved in tag <base href="..">). Try code below:

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}

link = 'https://www.phoenixcontact.com/online/portal/gb?1dmy&urile=wcm%3apath%3a/gben/web/main/products/list_pages/DC_charging_cables_P-10-11-01-01/aa4065f9-ec6c-4765-b2c7-d3b31d247fc6'

s = requests.Session()
s.headers.update(headers)
while True:
    response = s.get(link)
    soup = BeautifulSoup(response.text, "lxml")
    base_url = soup.select_one("base").get("href")

    next_page_element = soup.select_one(".pxc-pager-next")
    if next_page_element is not None:
        next_page_url = next_page_element.get("href")
        link = base_url + next_page_url
        print(link)
    else:
        break

Upvotes: 2

Related Questions