Reputation: 9
I'm trying to scrape a website. I learned to scrape from two resources: one used tag.get('href')
to get the href from an a
tag, and one used tag['href']
to get the same. As far as I understand it, they both do the same thing. But when I tried this code:
link_list = [l.get('href') for l in soup.find_all('a')]
it worked with the .get
method, but not with the dictionary access way.
link_list = [l['href'] for l in soup.find_all('a')]
This throws a KeyError
. I'm very new to scraping, so please pardon if this is a silly one.
Edit - Both of the methods worked for the find method instead of find_all.
Upvotes: 1
Views: 8791
Reputation: 473763
You may let BeautifulSoup
find the links with existing href
attributes only.
test
You can do it in two common ways, via find_all()
:
link_list = [a['href'] for a in soup.find_all('a', href=True)]
Or, with a CSS selector:
link_list = [a['href'] for a in soup.select('a[href]')]
Upvotes: 5
Reputation: 1587
Maybe HTML-string does not have a "href"? For example:
from bs4 import BeautifulSoup
doc_html = """<a class="vote-up-off" title="This question shows research effort; it is useful and clear">up vote</a>"""
soup = BeautifulSoup(doc_html, 'html.parser')
ahref = soup.find('a')
ahref.get('href')
Nothing will happen, but
ahref['href']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/sergey/.virtualenvs/soup_example/lib/python3.5/site-
packages/bs4/element.py", line 1011, in __getitem__
return self.attrs[key]
KeyError: 'href'
'href'
Upvotes: 0