Using beautiful soup 4 to scrape URLS within a
tag and save them to a text file

Question

I realize this is probably incredibly straightforward but please bear with me. I'm trying to use beautifulsoup 4 to scrape a website that has a list of blog posts for the urls of those posts. The tag that I want is within an tag. There are multiple tags that include a header and then a link that I want to capture. This is the code I'm working with:

with io.open('TPNurls.txt', 'a', encoding='utf8') as logfile:
   snippet = soup.find_all('p', class="postbody")
   for link in snippet.find('a'):
       fulllink = link.get('href')
       logfile.write(fulllink + "
")

The error I'm getting is:

AttributeError: 'ResultSet' object has no attribute 'find'

I understand that means "head" is a set and beautifulsoup doesn't let me look for tags within a set. But then how can I do this? I need it to find the entire set of tags and then look for the tag within each one and then save each one on a separate line to a file.

alecxe · Accepted Answer

The actual reason for the error is that snippet is a result of find_all() call and is basically a list of results, there is no find() function available on it. Instead, you meant:

snippet = soup.find('p', class_="postbody")
for link in snippet.find_all('a'):
    fulllink = link.get('href')
    logfile.write(fulllink + "
")

Also, note the use of class_ here - class is a reserved keyword and cannot be used as a keyword argument here. See Searching by CSS class for more info.

Alternatively, make use of CSS selectors:

for link in snippet.select('p.postbody a'):
   fulllink = link.get('href')
   logfile.write(fulllink + "
")

p.postbody a would match all a tags inside the p tag with class postbody.

Using beautiful soup 4 to scrape URLS within a <p class="postbody"> tag and save them to a text file

Answers (2)

Related Questions

Using beautiful soup 4 to scrape URLS within a &lt;p class=&quot;postbody&quot;&gt; tag and save them to a text file

Answers (2)

Related Questions

Using beautiful soup 4 to scrape URLS within a <p class="postbody"> tag and save them to a text file