Reputation: 35
With BeautifulSoup how would one get the links from a webpage, store them in a list, then print out a certain one? This is what I have so far:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://example.com/")
content = BeautifulSoup(html.read(), "html.parser")
for link in content.find_all("a"):
print(link.get("href")[0])
But I get this error:
TypeError: 'NoneType' object is not subscriptable
How can I solve this problem and get the first link?
Upvotes: 1
Views: 44
Reputation: 244
To retrieve all links from a page, use regex.
The following code should do it for you:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen("https://www.stmaryottumwa.org/")
content = BeautifulSoup(html.read(), "html.parser")
links = []
for link in content.find_all("a", attrs={'href': re.compile("^http")}):
links.append(link.get("href"))
print(links[0]) # print first link on page
The variable links will contain all the links on the page.
Upvotes: 2
Reputation: 81594
In order to get the element's attributes you need to access the .attrs
dict.
Also keep in mind that sometimes a
tags do not have an href
attribute at all, you can get around that by using .get
:
link.attrs.get('href')
I'm not sure what you expected [0]
to do since an a
tag can only have a single href
attribute. Using [0]
will get you the first character of the href
attribute.
for link in content.find_all("a"):
href = a.attrs.get('href')
if href:
print(href[0])
Upvotes: 2