Reputation: 1086
Given something like the following html:
<div>
<div>
<meta ... />
<img />
</div>
<div id="main">
<p class="foo">Hello, World</p>
<div>
<div class="bar">Hey, there!</div>
</div>
</div>
</div>
How would I go about selecting only the elements that have text and outputting a generated, unique css selector for said element?
For this example, that would be:
# can be even more specific if there are other .foo's
------
[ |
{ "html": "Hello, World", "selector": ".foo"},
{ "html": "Hey, there!", "selector": ".bar" }
]
Was playing with BeautifulSoup
and html_sanitizer
but wasn't getting great results.
Upvotes: 1
Views: 492
Reputation: 1772
This should be a piece of cake with BeautifulSoup
from bs4 import BeautifulSoup
html = """
<div>
<div>
<meta ... />
<img />
</div>
<div id="main">
<p class="foo">Hello, World</p>
<div>
<div class="bar">Hey, there!</div>
</div>
</div>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
results = []
for element in soup.find_all(string=True):
parent = element.parent
while parent and not (parent.has_attr('id') or parent.has_attr('class')):
parent = parent.parent
if parent and element.strip() != '':
if parent.has_attr('id'):
results.append({
"html": element.strip(),
"selector": '#' + parent['id']
})
elif parent.has_attr('class'):
results.append({
"html": element.strip(),
"selector": list(map(lambda cls: '.' + cls, parent['class']))
})
print(results)
Upvotes: 1