terry
terry

Reputation: 21

Python: BeautifulSoup find_all search within <a> tag or everything else?

Can anyone confirm if the "find_all" automatically search within tags? I was expecting "find_all" to pick up everything that has "a". But it actually picks up everything within "<a... < /a>"? Also, the difference between "find_all" and "find"?

from bs4 import BeautifulSoup
import requests
url = "https://boston.craigslist.org/search/sof"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data,'html.parser')
tags = soup.find_all("a")

Result to

[<a class="appstorebtn" href="https://play.google.com/store/apps/details?id=org.craigslist.CraigslistMobile">
         Android
     </a>,
 <a class="appstorebtn" href="https://apps.apple.com/us/app/craigslist/id1336642410">
         iOS
     </a>,
 <a class="header-logo" href="/" name="logoLink">CL</a>,
 <a href="/">boston</a>,
 <a href="https://post.craigslist.org/c/bos">post</a>,
 <a href="https://accounts.craigslist.org/login/home">account</a>,
 <a class="favlink" href="#"><span aria-hidden="true" class="icon icon-star fav"></span><span class="fav-number">0</span><span class="fav-label"> favorites</span></a>,
 <a class="to-banish-page-link" href="#">
 <span aria-hidden="true" class="icon icon-trash red"></span>
 <span class="banished_count">0</span>
 <span class="discards-label"> hidden</span>
 </a>,
 <a class="header-logo" href="/">CL</a>,

Upvotes: 1

Views: 940

Answers (1)

HedgeHog
HedgeHog

Reputation: 25087

find_all()

The find_all() method looks through a tag’s descendants, retrieves all descendants that match your filters and returns a list containing the result/results.

find() vs find_all()

  • Use find(), if you just want to get the first occurrence that match your filters.
  • Use find_all(), if you want to get all occurrences that match your filters.

Example - Get all href

from bs4 import BeautifulSoup
import requests
url = "https://boston.craigslist.org/search/sof"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data,'html.parser')
[a['href'] for a in soup.find_all('a',href=True)]

Output (you may have to iterate and clean it or customizing your filters above to get only href that contains http, ...)

['https://play.google.com/store/apps/details?id=org.craigslist.CraigslistMobile',
 'https://apps.apple.com/us/app/craigslist/id1336642410',
 '/',
 '/',
 'https://post.craigslist.org/c/bos',
 'https://accounts.craigslist.org/login/home',
 '#',
 '#',
 '/',
 'https://accounts.craigslist.org/savesearch/save?URL=https%3A%2F%2Fboston%2Ecraigslist%2Eorg%2Fd%2Fsoftware%2Dqa%2Ddba%2Detc%2Fsearch%2Fsof',
 '/d/software-qa-dba-etc/search/sof',
 '/d/software-qa-dba-etc/search/sof',
 '/d/software-qa-dba-etc/search/sof?sort=date&',
...]

Upvotes: 1

Related Questions