Abhilash Rao
Abhilash Rao

Reputation: 41

Is there a way to extract all the class name from an HTML file using BeautifulSoup?

<tr id="section_1asd8aa" class="main">
<td class="header">
  <table cellspacing="0" cellpadding="0">
      <tbody>
        <tr>
            <td style="font-family: arial,sans-serif; font-size: 11px;>DUMMY TEXT<a href="#">browser.</a>
            </td>
          </tr>
      </tbody>
    </table>
</td></tr>

Above is a sample html and I want to extract all the class names from the html file. Output:'{ "c1":"main","c2":"header"}'

Upvotes: 1

Views: 1246

Answers (1)

akuiper
akuiper

Reputation: 214927

You can use find_all to get a set of nodes, then loop through the set of nodes and check if the node has class attribute, if it has, return the class:

from bs4 import BeautifulSoup
soup = BeautifulSoup("""<tr id="section_1asd8aa" class="main">
<td class="header">
  <table cellspacing="0" cellpadding="0">
      <tbody>
        <tr>
            <td style="font-family: arial,sans-serif; font-size: 11px;>DUMMY TEXT<a href="#">browser.</a>
            </td>
          </tr>
      </tbody>
    </table>
</td></tr>""", "html.parser")

To get a list of class names:

lst = [node['class'] for node in soup.find_all() if node.has_attr('class')]
lst
# [['main'], ['header']]

Convert the list to a dictionary:

{"c"+str(i): v  for i, v in enumerate(lst)}
# {'c0': ['main'], 'c1': ['header']}

Notice the classes are wrapped in a list because some class can have multiple values. You can join the list as a single string if that's desired.

{"c"+str(i): " ".join(v)  for i, v in enumerate(lst)}
# {'c0': 'main', 'c1': 'header'}

Upvotes: 4

Related Questions