Regex to capture html elements with their class name

Question

I am trying to get the element and class name for all elements within an html file using python. I managed to get all class names with the code below. It's written like that because I will go through a lot of html files while storing elements with their class names. Ignoring elements without a class name.

 temp_file = open(root + "/" + file, "r", encoding="utf-8-sig", errors="ignore")
    temp_content = temp_file.read()
    class_names = re.findall("class=\"(.*?)\"", temp_content)

However now I am struggling to find a way to get the element that the class belongs to. Keep in mind that elements sometimes overlap with each other, so readlines() won't help too much either and it would proabably be slower than regexing the entire document at once.


        
              Logo Text Here
        
        
             
                HomePortfolio 
               About Me
                Contact Me

The above html snippet is badly indented on purpose, to showcase the kind of data I am working with... The goal would be to maybe store them in a hashmap. i.e.

"header_Container container_12": "div"
 "grid_5": "div"
 "grid_7": "div"
 "menu_items": "div"
 "home active": "a"
 "portfolio": "a"
 "about": "a"
 "contact": "a"

ggorlen · Accepted Answer

Regex is a poor choice for HTML parsing, but luckily this is trivial with BeautifulSoup:

from bs4 import BeautifulSoup

html = """
        
              Logo Text Here
        
        
             
                HomePortfolio 
               About Me
                Contact Me 
            
        
"""
    
for elem in BeautifulSoup(html, "lxml").find_all(attrs={"class": True}):
    print(elem.attrs["class"], elem.name)

Output:

['header_container', 'container_12'] div
['grid_5'] div
['grid_7'] div
['menu_items'] div
['home', 'active'] a
['portfolio'] a
['about'] a
['contact'] a

You can put this into a dict as you desire, but be careful since more than one element will likely map to each bucket. All it'd tell you is that an element exists and has a certain tag name given a specific class name string or tuple in a specific order.

elems = {}

for elem in BeautifulSoup(html, "lxml").find_all(attrs={"class": True}):
    elems[tuple(elem.attrs["class"])] = elem.name

for k, v in elems.items():
    print(k, v)

Regex to capture html elements with their class name

Answers (2)

Related Questions