Kevin Vignola Ruder
Kevin Vignola Ruder

Reputation: 31

Iterating through pages in Python using requests and beautifulsoup

I'm trying to extract links from a website. The webpage has more than a single page, so I using a loop to iterate through the different pages. The issue with this however is that the content in soup and new links is just duplicated. The URL used in requests.get change and I've double checked the link to make sure that the content of the URL changes, and it does.

new_links remains the same regardless of the iteration of the loop

Can anyone please explain how I can possible fix this ?

def get_links(root_url):

    list_of_links = []

    # how many pages should we scroll through ? currently set to 20
    for i in range(1,3):
        r = requests.get(root_url+"&page={}.".format(i))
        soup = BeautifulSoup(r.content, 'html.parser')
        new_links = soup.find_all("li", {"class": "padding-all"})
        list_of_links.extend(new_links)

    print(list_of_links)

    return list_of_links

Upvotes: 2

Views: 7102

Answers (1)

Martin Evans
Martin Evans

Reputation: 46779

You need to enumerate the links within the li you were looking for. It is probably best to add each to a set() to remove duplicates. This can then be converted to a sorted list on the return:

from bs4 import BeautifulSoup
import requests

def get_links(root_url):
    set_of_links = set()

    # how many pages should we scroll through ? currently set to 20
    for i in range(1, 3):
        r = requests.get(root_url+"&page={}".format(i))
        soup = BeautifulSoup(r.content, 'html.parser')

        for li in soup.find_all("li", {"class": "padding-all"}):
            for a in li.find_all('a', href=True):
                set_of_links.update([a['href']])

    return sorted(set_of_links)

for index, link in enumerate(get_links("http://borsen.dk/soegning.html?query=iot"), start=1):
    print(index, link)

Giving you:

1 http://borsen.dk/nyheder/avisen/artikel/11/102926/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
2 http://borsen.dk/nyheder/avisen/artikel/11/111767/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
3 http://borsen.dk/nyheder/avisen/artikel/11/111771/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
4 http://borsen.dk/nyheder/avisen/artikel/11/111776/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
5 http://borsen.dk/nyheder/avisen/artikel/11/111789/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
6 http://borsen.dk/nyheder/avisen/artikel/11/114652/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
7 http://borsen.dk/nyheder/avisen/artikel/11/114677/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
8 http://borsen.dk/nyheder/avisen/artikel/11/117729/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
9 http://borsen.dk/nyheder/avisen/artikel/11/122984/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
10 http://borsen.dk/nyheder/avisen/artikel/11/124160/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
11 http://borsen.dk/nyheder/avisen/artikel/11/130267/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
12 http://borsen.dk/nyheder/avisen/artikel/11/130268/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
13 http://borsen.dk/nyheder/avisen/artikel/11/130272/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
14 http://borsen.dk/nyheder/avisen/artikel/11/130882/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
15 http://borsen.dk/nyheder/avisen/artikel/11/132641/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
16 http://borsen.dk/nyheder/avisen/artikel/11/145430/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
17 http://borsen.dk/nyheder/avisen/artikel/11/149967/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
18 http://borsen.dk/nyheder/avisen/artikel/11/151618/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
19 http://borsen.dk/nyheder/avisen/artikel/11/158183/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
20 http://borsen.dk/nyheder/avisen/artikel/11/158769/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
21 http://borsen.dk/nyheder/avisen/artikel/11/44962/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
22 http://borsen.dk/nyheder/avisen/artikel/11/93884/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
23 http://borsen.dk/nyheder/avisen/artikel/11/93890/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
24 http://borsen.dk/nyheder/avisen/artikel/11/93896/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
25 http://borsen.dk/nyheder/executive/artikel/11/161556/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
26 http://borsen.dk/nyheder/virksomheder/artikel/1/315489/rapport_digitale_tiltag_kan_transformere_danske_selskaber.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
27 http://borsen.dk/nyheder/virksomheder/artikel/1/337498/danske_virksomheder_overser_den_digitale_revolution.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
28 http://borsen.dk/opinion/blogs/view/17/3614/tingenes_internet__hvornaar_bliver_det_til_virkelighed.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
29 http://borsen.dk/opinion/blogs/view/17/4235/digitalisering_og_nye_forretningsmodeller.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
30 http://ledelse.borsen.dk/artikel/1/323424/burde_digitalisering_vaere_hoejere_paa_listen_over_foretrukne_ledelsesvaerktoejer.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
31 http://pleasure.borsen.dk/gadget/artikel/1/305849/digital_butler_styrer_din_kommende_bolig.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,

It would probably make more sense also to just search for the link in the next page button rather than guess how many pages to iterate over, for example:

from bs4 import BeautifulSoup
import requests

def get_links(root_url):
    links = []

    while True:
        print(root_url)
        r = requests.get(root_url)
        soup = BeautifulSoup(r.content, 'html.parser')

        for li in soup.find_all("li", {"class": "padding-all"}):
            for a in li.find_all('a', href=True)[:1]:
                links.append(a['href'])

        next_page = soup.find("div", {"class": "next-container"})

        if next_page:
            next_url = next_page.find("a", href=True)

            if next_url:
                root_url = next_url['href']
            else:
                break
        else:
            break

    return links

Upvotes: 2

Related Questions