Reputation: 694
I have a simple HTML file I want to convert. Depending on the class of the tag I need to modifiy the content:
<HTML>
<HEAD>
<TITLE>Eine einfache HTML-Datei</TITLE>
<meta name="description" content="A simple HTML page for BS4">
<meta name="author" content="Uwe Ziegenhagen">
<meta charset="UTF-8">
</HEAD>
<BODY>
<H1>Hallo Welt</H1>
<p>Ein kurzer Absatz mit ein wenig Text, der relativ nichtssagend ist.</p>
<H1>Nochmal Hallo Welt!</H1>
<p>Schon wieder ein kurzer Absatz mit ein wenig Text, der genauso nichtssagend ist wie der Absatz zuvor.</p>
</BODY>
</HTML>
How can I go through the BS4 tree and do certain modifications depending on whether I have a "H1" or "p" or another class of tag? I imagine I need some switch statement to decide at each element how to deal with it.
from bs4 import BeautifulSoup
with open ("simple.html", "r") as htmlsource:
html=htmlsource.read()
soup = BeautifulSoup(html)
for item in soup.body:
print(item)
Upvotes: 1
Views: 1447
Reputation: 3735
Try this code:
from bs4 import BeautifulSoup
with open ("simple.html", "r") as htmlsource:
html=htmlsource.read()
soup = BeautifulSoup(html)
for item in soup.body:
print(item)
# You will select all of elements in the HTML page
elems = soup.findAll()
for item in elems:
try:
# Check if the class element is equal to a specified class
if 'myClass' == item['class'][0]:
print(item)
# Check if the tagname element is equal to a specified tagname
elif 'p' == item.name:
print(item)
except KeyError:
pass
Upvotes: 0
Reputation: 84351
BeautifulSoup tag objects have a name
property which you can check. For example, here's a function which transforms the tree by adding the string "Done with this " + the appropriate tag name to each node in a postwalk:
def walk(soup):
if hasattr(soup, "name"):
for child in soup.children:
walk(child)
soup.append("Done with this " + soup.name)
NB. the NavigableString
objects representing textual content and the Comment
objects representing comments don't have attributes such as name
or children
, so if you walk the entire tree like above, you need to check if you actually have a tag in hand (which I'm doing with the hasattr
call above; I suppose you could check that the type is bs4.element.Tag
).
Upvotes: 1