Reputation: 31
I'm trying to extract links from a website. The webpage has more than a single page, so I using a loop to iterate through the different pages. The issue with this however is that the content in soup and new links is just duplicated. The URL used in requests.get change and I've double checked the link to make sure that the content of the URL changes, and it does.
new_links remains the same regardless of the iteration of the loop
Can anyone please explain how I can possible fix this ?
def get_links(root_url):
list_of_links = []
# how many pages should we scroll through ? currently set to 20
for i in range(1,3):
r = requests.get(root_url+"&page={}.".format(i))
soup = BeautifulSoup(r.content, 'html.parser')
new_links = soup.find_all("li", {"class": "padding-all"})
list_of_links.extend(new_links)
print(list_of_links)
return list_of_links
Upvotes: 2
Views: 7102
Reputation: 46779
You need to enumerate the links within the li
you were looking for. It is probably best to add each to a set()
to remove duplicates. This can then be converted to a sorted list on the return:
from bs4 import BeautifulSoup
import requests
def get_links(root_url):
set_of_links = set()
# how many pages should we scroll through ? currently set to 20
for i in range(1, 3):
r = requests.get(root_url+"&page={}".format(i))
soup = BeautifulSoup(r.content, 'html.parser')
for li in soup.find_all("li", {"class": "padding-all"}):
for a in li.find_all('a', href=True):
set_of_links.update([a['href']])
return sorted(set_of_links)
for index, link in enumerate(get_links("http://borsen.dk/soegning.html?query=iot"), start=1):
print(index, link)
Giving you:
1 http://borsen.dk/nyheder/avisen/artikel/11/102926/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
2 http://borsen.dk/nyheder/avisen/artikel/11/111767/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
3 http://borsen.dk/nyheder/avisen/artikel/11/111771/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
4 http://borsen.dk/nyheder/avisen/artikel/11/111776/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
5 http://borsen.dk/nyheder/avisen/artikel/11/111789/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
6 http://borsen.dk/nyheder/avisen/artikel/11/114652/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
7 http://borsen.dk/nyheder/avisen/artikel/11/114677/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
8 http://borsen.dk/nyheder/avisen/artikel/11/117729/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
9 http://borsen.dk/nyheder/avisen/artikel/11/122984/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
10 http://borsen.dk/nyheder/avisen/artikel/11/124160/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
11 http://borsen.dk/nyheder/avisen/artikel/11/130267/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
12 http://borsen.dk/nyheder/avisen/artikel/11/130268/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
13 http://borsen.dk/nyheder/avisen/artikel/11/130272/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
14 http://borsen.dk/nyheder/avisen/artikel/11/130882/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
15 http://borsen.dk/nyheder/avisen/artikel/11/132641/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
16 http://borsen.dk/nyheder/avisen/artikel/11/145430/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
17 http://borsen.dk/nyheder/avisen/artikel/11/149967/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
18 http://borsen.dk/nyheder/avisen/artikel/11/151618/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
19 http://borsen.dk/nyheder/avisen/artikel/11/158183/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
20 http://borsen.dk/nyheder/avisen/artikel/11/158769/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
21 http://borsen.dk/nyheder/avisen/artikel/11/44962/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
22 http://borsen.dk/nyheder/avisen/artikel/11/93884/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
23 http://borsen.dk/nyheder/avisen/artikel/11/93890/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
24 http://borsen.dk/nyheder/avisen/artikel/11/93896/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6ODtzOjM6IklPVCI7fQ,,
25 http://borsen.dk/nyheder/executive/artikel/11/161556/artikel.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
26 http://borsen.dk/nyheder/virksomheder/artikel/1/315489/rapport_digitale_tiltag_kan_transformere_danske_selskaber.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
27 http://borsen.dk/nyheder/virksomheder/artikel/1/337498/danske_virksomheder_overser_den_digitale_revolution.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
28 http://borsen.dk/opinion/blogs/view/17/3614/tingenes_internet__hvornaar_bliver_det_til_virkelighed.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
29 http://borsen.dk/opinion/blogs/view/17/4235/digitalisering_og_nye_forretningsmodeller.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
30 http://ledelse.borsen.dk/artikel/1/323424/burde_digitalisering_vaere_hoejere_paa_listen_over_foretrukne_ledelsesvaerktoejer.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
31 http://pleasure.borsen.dk/gadget/artikel/1/305849/digital_butler_styrer_din_kommende_bolig.html?hl=YToyOntpOjA7czozOiJJb1QiO2k6NjtzOjM6IklPVCI7fQ,,
It would probably make more sense also to just search for the link in the next page
button rather than guess how many pages to iterate over, for example:
from bs4 import BeautifulSoup
import requests
def get_links(root_url):
links = []
while True:
print(root_url)
r = requests.get(root_url)
soup = BeautifulSoup(r.content, 'html.parser')
for li in soup.find_all("li", {"class": "padding-all"}):
for a in li.find_all('a', href=True)[:1]:
links.append(a['href'])
next_page = soup.find("div", {"class": "next-container"})
if next_page:
next_url = next_page.find("a", href=True)
if next_url:
root_url = next_url['href']
else:
break
else:
break
return links
Upvotes: 2