Reputation: 39
I am trying to select links from a list 2000+ items long. In the end I want to be able to follow the links in the list and open the next pages. I am able to have beautiful soup print the li list I want, but I can't figure out how to follow the links. At the end of the code below, I have tried adding this:
for link in RHAS:
print(link.get('href'))
but I get this error:
AttributeError: 'NavigableString' object has no attribute 'get'
I think this has to do with the HTML still being attached to the code (i.e., a, li, and HREF tags show in the code when I print the li). How do I get it to follow the links?
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
# The website I am starting at
my_url = 'https://mars.nasa.gov/msl/multimedia/raw/'
#calls the urlopen function from the request module of the urllib module
#AKA opens up the connection and grabs the page
uClient = uReq(my_url)
#imports the webpage from html format into python.
page_html = uClient.read()
#closes the client
uClient.close()
#parses the HTML using bs4
page_soup = soup(page_html, "lxml")
#finds the categories for the types of images on the site, category 1 is
#RHAZ
containers = page_soup.findAll("div", {"class": "image_list"})
RHAZ = containers[1]
# prints the li list that has the links I want
for child in RHAZ:
print(child)
Upvotes: 1
Views: 2018
Reputation: 1402
The child node contains all div, ul, li, a
tags in it and that is why you get the error.
If you want to get href from all anchor tags, find all the anchor tags and extract href
from it as shown below.
for link in RHAZ.findAll('a'):
print(link['href'])
print(link['href'], link.text) # if you need both href and text
P.S.: Instead of stating error and explaining your scenario after that, you can explain the scenario you are handling and then show the error you are facing. That will be more clear and you will get proper response easily.
Upvotes: 2