Jolijt Tamanaha
Jolijt Tamanaha

Reputation: 333

Using beautiful soup 4 to scrape URLS within a <p class="postbody"> tag and save them to a text file

I realize this is probably incredibly straightforward but please bear with me. I'm trying to use beautifulsoup 4 to scrape a website that has a list of blog posts for the urls of those posts. The tag that I want is within an tag. There are multiple tags that include a header and then a link that I want to capture. This is the code I'm working with:

with io.open('TPNurls.txt', 'a', encoding='utf8') as logfile:
   snippet = soup.find_all('p', class="postbody")
   for link in snippet.find('a'):
       fulllink = link.get('href')
       logfile.write(fulllink + "\n")

The error I'm getting is:

AttributeError: 'ResultSet' object has no attribute 'find'

I understand that means "head" is a set and beautifulsoup doesn't let me look for tags within a set. But then how can I do this? I need it to find the entire set of tags and then look for the tag within each one and then save each one on a separate line to a file.

Upvotes: 2

Views: 552

Answers (2)

salmanwahed
salmanwahed

Reputation: 9657

In your code,

snippet = soup.find_all('p', class="postbody")
for link in snippet.find('a'):

Here snippet is a bs4.element.ResultSet type object. So you are getting this error. But the elements of this ResultSet object are bs4.element.Tag type where you can apply find method.

Change your code like this,

snippet = soup.find_all("p", { "class" : "postbody" })
for link in snippet:
    if link.find('a'):
        fulllink = link.a['href']
        logfile.write(fulllink + "\n")

Upvotes: 0

alecxe
alecxe

Reputation: 474121

The actual reason for the error is that snippet is a result of find_all() call and is basically a list of results, there is no find() function available on it. Instead, you meant:

snippet = soup.find('p', class_="postbody")
for link in snippet.find_all('a'):
    fulllink = link.get('href')
    logfile.write(fulllink + "\n")

Also, note the use of class_ here - class is a reserved keyword and cannot be used as a keyword argument here. See Searching by CSS class for more info.


Alternatively, make use of CSS selectors:

for link in snippet.select('p.postbody a'):
   fulllink = link.get('href')
   logfile.write(fulllink + "\n")

p.postbody a would match all a tags inside the p tag with class postbody.

Upvotes: 4

Related Questions