Reputation: 15
I am working on creating a web-scraping tool that will download articles to txt files. I have created the soup with bs4 and pulled out the specific piece of html that contains the desired url for the article I want to download:
>>>prevLink = soup2.select('.previous_post')
>>>prevLink
[<span class="previous_post">Previous Post: <a href="http://www.mrmoneymustache.com/2018/11/08/honey-badger-entrepreneur/" rel="prev">An Interview With The Man Who Never Needed a Real Job</a></span>]
So far so good (I think). Then I try to use .get('href') to pull out the link, but it returns 'none'.
>>>print(prevLink[0].get('href'))
None
When I use .get('class') to select for the class, however, it seems to work.
>>> print(prevLink[0].get('class'))
['previous_post']
I don't understand why .get('class') is acting differently than .get('href'). Thanks for looking.
Upvotes: 1
Views: 161
Reputation: 474221
prevLink
is not actually referencing a link, but span
element.
Just get deeper to the a
element with your selector:
prevLink = soup2.select_one('.previous_post > a')
print(prevLink.get('href'))
Upvotes: 1