Jim jason
Jim jason

Reputation: 19

How to remove HTML tags of a result from Beatifulsoup find all

I need to remove the tags and leave only the text in the below codes output using python and beautifulsoup.

Output : enter image description here

import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://www.w3schools.com/html/html_intro.asp")
soup = bs(r.content)
print(soup.prettify())


first_header = soup.find(["h2", "h2"])

first_headers = soup.find_all(["h2", "h2"])
first_headers

Upvotes: 0

Views: 428

Answers (2)

HedgeHog
HedgeHog

Reputation: 25048

To get only the text from your ResultSet iterate over it e.g. with list comprehension, call .text for every element and .join() all text elements by whitespace:

' '.join([e.text for e in soup.find_all('h2')])  

Example

import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://www.w3schools.com/html/html_intro.asp")
soup = bs(r.content)


first_headers = ' '.join([e.text for e in soup.find_all('h2')])

print(first_headers)

Output

Tutorials References Exercises and Quizzes HTML Tutorial HTML Forms HTML Graphics HTML Media HTML APIs HTML Examples HTML References What is HTML? A Simple HTML Document What is an HTML Element? Web Browsers HTML Page Structure HTML History Report Error Thank You For Helping Us!

Upvotes: 0

Sukka Rishivarun Goud
Sukka Rishivarun Goud

Reputation: 24

import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://www.w3schools.com/html/html_intro.asp")
soup = bs(r.content,features="html.parser") # getting content from webpage
# retriving all h1 and h2 tags and extracting text from each of them 
first_headers = [html.text for html in soup.find_all(["h1", "h2"])] 
print(first_headers)

I used list comprehension to solve it in a single line you can use a for loop instead which goes as

import requests
from bs4 import BeautifulSoup as bs
r = requests.get("https://www.w3schools.com/html/html_intro.asp")
soup = bs(r.content,features="html.parser")

first_headers = soup.find_all(["h1", "h2"])
for i in first_headers:
    print(i.text)

This is the output of my code:

Tutorials
References
Exercises and Quizzes
HTML Tutorial
HTML Forms
HTML Graphics
HTML Media
HTML APIs
HTML Examples
HTML References
HTML Introduction
What is HTML?
A Simple HTML Document
What is an HTML Element?
Web Browsers
HTML Page Structure
HTML History
Report Error
Thank You For Helping Us!

Upvotes: 1

Related Questions