Bimons
Bimons

Reputation: 241

Finding links with beautifulsoup in Python

I am having a hard time trying to extract the hyperlinks from a page with beatifulsoup. I have tried many different tags and classes but cant seem to get it without a whole bunch of other html I don't want. Is anyone able to tell me where i'm going wrong? Code below:

from bs4 import BeautifulSoup
import requests

page_link = url

page_response = requests.get(page_link, timeout=5)

soup = BeautifulSoup(page_response.content, "html.parser")

pagecode = soup.find(class_='infinite-scroll-container')

title = pagecode.findAll('i')
artist = pagecode.find_all('h1', "exhibition-title")
links = pagecode.find_all('article', "teaser infinite-scroll-item")


printcount=0
while printcount < len(title):  
    titlestring = title[printcount].text  
    artiststring = artist[printcount].text
    artiststring = artiststring.replace(titlestring, '')
    artiststring = artiststring.strip()
    titlestring = titlestring.strip()
    print(artiststring)
    print(titlestring)
    print("----------------------------------------")
    printcount = printcount+1

Upvotes: 1

Views: 118

Answers (1)

Bitto
Bitto

Reputation: 8205

You could directly target the all the links in that page and then filter it to get get links within an article. Note that this page is fully loaded only on scroll, you may have to use selenium to get all the links. For now i will answer on how to filter the links.

from bs4 import BeautifulSoup
import requests
import re
page_link = 'https://hopkinsonmossman.com/exhibitions/past/'
page_response = requests.get(page_link, timeout=5)
soup = BeautifulSoup(page_response.content, "html.parser")
links= soup.find_all('a')
for link in links:
    if link.parent.name=='article':#only article links
        print(re.sub(r"\s\s+", " ", link.text).strip())#replace multiple spaces with one
        print(link['href'])
        print() 

Output

Nicola Farquhar A Holotype Heart 22 Nov – 21 Dec 2018 Wellington
https://hopkinsonmossman.com/exhibitions/nicola-farquhar-5/

Bill Culbert Desk Lamp, Crash 19 Oct – 17 Nov 2018 Wellington
https://hopkinsonmossman.com/exhibitions/bill-culbert-2/

Nick Austin, Ammon Ngakuru Many Happy Returns 18 Oct – 15 Nov 2018 Auckland
https://hopkinsonmossman.com/exhibitions/nick-austin-ammon-ngakuru/

Dane Mitchell Tuning 13 Sep – 13 Oct 2018 Wellington
https://hopkinsonmossman.com/exhibitions/dane-mitchell-4/

Shannon Te Ao my life as a tunnel 08 Sep – 13 Oct 2018 Auckland
https://hopkinsonmossman.com/exhibitions/shannon-te-ao/

Tilt Anoushka Akel, Ruth Buchanan, Meg Porteous 16 Aug – 08 Sep 2018 Wellington
https://hopkinsonmossman.com/exhibitions/anoushka-akel-ruth-buchanan-meg-porteous/

Shadow Work Fiona Connor, Oliver Perkins 02 Aug – 01 Sep 2018 Auckland
https://hopkinsonmossman.com/exhibitions/group-show/

Emma McIntyre Rose on red 13 Jul – 11 Aug 2018 Wellington
https://hopkinsonmossman.com/exhibitions/emma-mcintyre-2/

Tahi Moore Incomprehensible public fictions: Writer fights politician in car park 04 Jul – 28 Jul 2018 Auckland
https://hopkinsonmossman.com/exhibitions/tahi-moore-2/

Oliver Perkins Bleeding Edge 01 Jun – 07 Jul 2018 Wellington
https://hopkinsonmossman.com/exhibitions/oliver-perkins-2/

Spinning Phillip Lai, Peter Robinson 19 May – 23 Jun 2018 Auckland
https://hopkinsonmossman.com/exhibitions/1437/

Milli Jannides Cavewoman 19 Apr – 26 May 2018 Wellington
https://hopkinsonmossman.com/exhibitions/milli-jannides/

Oscar Enberg Taste & Power, a prologue 06 Apr – 12 May 2018 Auckland
https://hopkinsonmossman.com/exhibitions/oscar-enberg/

Fiona Connor Closed Down Clubs & Monochromes 09 Mar – 14 Apr 2018 Wellington
https://hopkinsonmossman.com/exhibitions/closed-down-clubs-and-monochromes/

Bill Culbert Colour Theory, Window Mobile 02 Mar – 29 Mar 2018 Auckland
https://hopkinsonmossman.com/exhibitions/colour-theory-window-mobile/

Role Models Curated by Rob McKenzie
Robert Bittenbender, Ellen Cantor, Jennifer McCamley, Josef Strau 26 Jan – 24 Feb 2018 Auckland
https://hopkinsonmossman.com/exhibitions/role-models/

Emma McIntyre Pink Square Sways 24 Nov – 23 Dec 2017 Auckland
https://hopkinsonmossman.com/exhibitions/emma-mcintyre/

My initial thought was to use the "ajax-link" class, but turns out the 'HOPKINSON MOSSMAN' link also has that class. You could also use that approach and filter out the first link in the find_all, which will give you the same result.

from bs4 import BeautifulSoup
import requests
import re
page_link = 'https://hopkinsonmossman.com/exhibitions/past/'
page_response = requests.get(page_link, timeout=5)
soup = BeautifulSoup(page_response.content, "html.parser")
links= soup.find_all('a',class_='ajax-link')
for link in links[1:]:
        print(re.sub(r"\s\s+", " ", link.text).strip())#replace multiple spaces with one
        print(link['href'])
        print()

Upvotes: 2

Related Questions