Uwe Ziegenhagen
Uwe Ziegenhagen

Reputation: 694

How to get the names from elements Beautiful Soup 4 has parsed

I have a simple HTML file I want to convert. Depending on the class of the tag I need to modifiy the content:

<HTML>
<HEAD>
<TITLE>Eine einfache HTML-Datei</TITLE>
<meta name="description" content="A simple HTML page for BS4">
<meta name="author" content="Uwe Ziegenhagen">
<meta charset="UTF-8">
</HEAD>
<BODY>

<H1>Hallo Welt</H1>

<p>Ein kurzer Absatz mit ein wenig Text, der relativ nichtssagend ist.</p>

<H1>Nochmal Hallo Welt!</H1>

<p>Schon wieder ein kurzer Absatz mit ein wenig Text, der genauso nichtssagend ist wie der Absatz zuvor.</p>

</BODY>
</HTML>

How can I go through the BS4 tree and do certain modifications depending on whether I have a "H1" or "p" or another class of tag? I imagine I need some switch statement to decide at each element how to deal with it.

from bs4 import BeautifulSoup

with open ("simple.html", "r") as htmlsource:
  html=htmlsource.read()

soup = BeautifulSoup(html)

for item in soup.body:
  print(item)

Upvotes: 1

Views: 1447

Answers (2)

amatellanes
amatellanes

Reputation: 3735

Try this code:

from bs4 import BeautifulSoup
with open ("simple.html", "r") as htmlsource:
    html=htmlsource.read()

soup = BeautifulSoup(html)

for item in soup.body:
    print(item)

# You will select all of elements in the HTML page
elems = soup.findAll()
for item in elems:
   try:
      # Check if the class element is equal to a specified class
      if 'myClass' == item['class'][0]:
         print(item)

     # Check if the tagname element is equal to a specified tagname
     elif 'p' == item.name:
        print(item)

  except KeyError:
     pass

Upvotes: 0

Michał Marczyk
Michał Marczyk

Reputation: 84351

BeautifulSoup tag objects have a name property which you can check. For example, here's a function which transforms the tree by adding the string "Done with this " + the appropriate tag name to each node in a postwalk:

def walk(soup):
    if hasattr(soup, "name"):
        for child in soup.children:
            walk(child)
        soup.append("Done with this " + soup.name)

NB. the NavigableString objects representing textual content and the Comment objects representing comments don't have attributes such as name or children, so if you walk the entire tree like above, you need to check if you actually have a tag in hand (which I'm doing with the hasattr call above; I suppose you could check that the type is bs4.element.Tag).

Upvotes: 1

Related Questions