hiimarksman
hiimarksman

Reputation: 301

How to grab all headers from a website using BeautifulSoup?

I'm trying to grab all the headers from a simple website. My attempt:

from bs4 import BeautifulSoup, SoupStrainer
import requests

url = "http://nypost.com/business"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
soup.find_all('h')

soup.find_all('h') returns [], but if I do something like soup.h1 or soup.h2, it returns that respective data. Am I just calling the method incorrectly?

Upvotes: 17

Views: 23919

Answers (4)

sameh sharawy
sameh sharawy

Reputation: 31

when using the method find or find_all you can pass a string or a list of tags

soup.find_all([f'h{i}' for i in range(1,7) ])

or

soup.find_all(['h{}'.format(i) for i in range(1,7)])

Upvotes: 3

phd
phd

Reputation: 94676

Filter by regular expression:

soup.find_all(re.compile('^h[1-6]$'))

This regex finds all tags that start with h, have a digit after the h, and then end after the digit.

Upvotes: 24

SIM
SIM

Reputation: 22440

If you do not wish to use regex then you might wanna do something like:

from bs4 import BeautifulSoup
import requests

url = "http://nypost.com/business"

page = BeautifulSoup(requests.get(url).text, "lxml")
for headlines in page.find_all("h3"):
    print(headlines.text.strip())

Results:

The epitome of chic fashion is the latest victim of retail's collapse
Rent-a-Center shares soar after rejecting takeover bid
NFL ad revenue may go limp with loss of erectile-dysfunction ads
'Pharma Bro' talked about sex with men to get my money, investor says

And So On------

Upvotes: 3

PYA
PYA

Reputation: 8636

you need to do soup.find_all('h1')

you could do something like:

for a in ["h1","h2"]:
  soup.find_all(a)

Upvotes: 2

Related Questions