Reputation: 1475
I dont understand why do i get this error:
I have a fairly simple function:
def scrape_a(url):
r = requests.get(url)
soup = BeautifulSoup(r.content)
news = soup.find_all("div", attrs={"class": "news"})
for links in news:
link = news.find_all("href")
return link
Here is th estructure of webpage I am trying to scrape:
<div class="news">
<a href="www.link.com">
<h2 class="heading">
heading
</h2>
<div class="teaserImg">
<img alt="" border="0" height="124" src="/image">
</div>
<p> text </p>
</a>
</div>
Upvotes: 1
Views: 8619
Reputation: 1122232
You are doing two things wrong:
You are calling find_all
on the news
result set; presumably you meant to call it on the links
object, one element in that result set.
There are no <href ...>
tags in your document, so searching with find_all('href')
is not going to get you anything. You only have tags with an href
attribute.
You could correct your code to:
def scrape_a(url):
r = requests.get(url)
soup = BeautifulSoup(r.content)
news = soup.find_all("div", attrs={"class": "news"})
for links in news:
link = links.find_all(href=True)
return link
to do what I think you tried to do.
I'd use a CSS selector:
def scrape_a(url):
r = requests.get(url)
soup = BeautifulSoup(r.content)
news_links = soup.select("div.news [href]")
if news_links:
return news_links[0]
If you wanted to return the value of the href
attribute (the link itself), you need to extract that too, of course:
return news_links[0]['href']
If you needed all the link objects, and not the first, simply return news_links
for the link objects, or use a list comprehension to extract the URLs:
return [link['href'] for link in news_links]
Upvotes: 5