Reputation: 65
I am trying to get the element and class name for all elements within an html file using python. I managed to get all class names with the code below. It's written like that because I will go through a lot of html files while storing elements with their class names. Ignoring elements without a class name.
temp_file = open(root + "/" + file, "r", encoding="utf-8-sig", errors="ignore")
temp_content = temp_file.read()
class_names = re.findall("class=\"(.*?)\"", temp_content)
However now I am struggling to find a way to get the element that the class belongs to. Keep in mind that elements sometimes overlap with each other, so readlines() won't help too much either and it would proabably be slower than regexing the entire document at once.
<div class="header_container container_12">
<div class="grid_5">
<h1><a href="#">Logo Text Here</a></h1>
</div>
<div class="grid_7">
<div class="menu_items">
<a href="#" class="home active">Home</a><a href="#" class="portfolio">Portfolio</a>
<a href="#"
class="about">About Me
</a><a href="#" class="contact">Contact Me</a>
</div>
</div>
</div>
The above html snippet is badly indented on purpose, to showcase the kind of data I am working with... The goal would be to maybe store them in a hashmap. i.e.
"header_Container container_12": "div"
"grid_5": "div"
"grid_7": "div"
"menu_items": "div"
"home active": "a"
"portfolio": "a"
"about": "a"
"contact": "a"
Upvotes: 2
Views: 887
Reputation: 57115
Regex is a poor choice for HTML parsing, but luckily this is trivial with BeautifulSoup:
from bs4 import BeautifulSoup
html = """<div class="header_container container_12">
<div class="grid_5">
<h1><a href="#">Logo Text Here</a></h1>
</div>
<div class="grid_7">
<div class="menu_items">
<a href="#" class="home active">Home</a><a href="#" class="portfolio">Portfolio</a>
<a href="#"
class="about">About Me
</a><a href="#" class="contact">Contact Me</a>
</div>
</div>
</div>"""
for elem in BeautifulSoup(html, "lxml").find_all(attrs={"class": True}):
print(elem.attrs["class"], elem.name)
Output:
['header_container', 'container_12'] div
['grid_5'] div
['grid_7'] div
['menu_items'] div
['home', 'active'] a
['portfolio'] a
['about'] a
['contact'] a
You can put this into a dict as you desire, but be careful since more than one element will likely map to each bucket. All it'd tell you is that an element exists and has a certain tag name given a specific class name string or tuple in a specific order.
elems = {}
for elem in BeautifulSoup(html, "lxml").find_all(attrs={"class": True}):
elems[tuple(elem.attrs["class"])] = elem.name
for k, v in elems.items():
print(k, v)
Upvotes: 3
Reputation: 17637
I think regex is the wrong tool for the job here, consider loading your HTML into a DOM document and parsing it using DOM selectors instead.
The following example is javascript, because it will allow me to include it as a runnable snippet - but it should explain the approach enough for you to create the python equivalent.
var classElements = document.querySelectorAll("[class]");
for(i = 0; i < classElements.length; i++)
{
console.log(classElements[i].className + ": " + classElements[i].tagName);
}
<div class="header_container container_12">
<div class="grid_5">
<h1><a href="#">Logo Text Here</a></h1>
</div>
<div class="grid_7">
<div class="menu_items">
<a href="#" class="home active">Home</a><a href="#" class="portfolio">Portfolio</a>
<a href="#"
class="about">About Me
</a><a href="#" class="contact">Contact Me</a>
</div>
</div>
Upvotes: 0