Is there a way to extract all the class name from an HTML file using BeautifulSoup?

Question



  
      
        
            browser.

Above is a sample html and I want to extract all the class names from the html file. Output:'{ "c1":"main","c2":"header"}'

akuiper · Accepted Answer

You can use find_all to get a set of nodes, then loop through the set of nodes and check if the node has class attribute, if it has, return the class:

from bs4 import BeautifulSoup
soup = BeautifulSoup("""

  
      
        
            browser.
            
          
      
    
""", "html.parser")

To get a list of class names:

lst = [node['class'] for node in soup.find_all() if node.has_attr('class')]
lst
# [['main'], ['header']]

Convert the list to a dictionary:

{"c"+str(i): v  for i, v in enumerate(lst)}
# {'c0': ['main'], 'c1': ['header']}

Notice the classes are wrapped in a list because some class can have multiple values. You can join the list as a single string if that's desired.

{"c"+str(i): " ".join(v)  for i, v in enumerate(lst)}
# {'c0': 'main', 'c1': 'header'}

Is there a way to extract all the class name from an HTML file using BeautifulSoup?

Answers (1)

Related Questions