Reputation: 41
<tr id="section_1asd8aa" class="main">
<td class="header">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="font-family: arial,sans-serif; font-size: 11px;>DUMMY TEXT<a href="#">browser.</a>
</td>
</tr>
</tbody>
</table>
</td></tr>
Above is a sample html and I want to extract all the class names from the html file. Output:'{ "c1":"main","c2":"header"}'
Upvotes: 1
Views: 1246
Reputation: 214927
You can use find_all
to get a set of nodes, then loop through the set of nodes and check if the node has class
attribute, if it has, return the class:
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<tr id="section_1asd8aa" class="main">
<td class="header">
<table cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td style="font-family: arial,sans-serif; font-size: 11px;>DUMMY TEXT<a href="#">browser.</a>
</td>
</tr>
</tbody>
</table>
</td></tr>""", "html.parser")
To get a list of class names:
lst = [node['class'] for node in soup.find_all() if node.has_attr('class')]
lst
# [['main'], ['header']]
Convert the list to a dictionary:
{"c"+str(i): v for i, v in enumerate(lst)}
# {'c0': ['main'], 'c1': ['header']}
Notice the classes are wrapped in a list because some class can have multiple values. You can join the list as a single string if that's desired.
{"c"+str(i): " ".join(v) for i, v in enumerate(lst)}
# {'c0': 'main', 'c1': 'header'}
Upvotes: 4