Jack
Jack

Reputation: 1086

Python: Extract text and element selectors from html elements

Given something like the following html:

<div>
     <div>
         <meta ... />
         <img />
     </div>
     <div id="main">
        <p class="foo">Hello, World</p>
        <div>
           <div class="bar">Hey, there!</div>
        </div>
     </div>
</div>

How would I go about selecting only the elements that have text and outputting a generated, unique css selector for said element?

For this example, that would be:

 # can be even more specific if there are other .foo's
                                        ------
[                                          |
  { "html": "Hello, World", "selector": ".foo"},
  { "html": "Hey, there!", "selector": ".bar" }
]

Was playing with BeautifulSoup and html_sanitizer but wasn't getting great results.

Upvotes: 1

Views: 492

Answers (1)

DVN-Anakin
DVN-Anakin

Reputation: 1772

This should be a piece of cake with BeautifulSoup

from bs4 import BeautifulSoup

html = """
<div>
     <div>
         <meta ... />
         <img />
     </div>
     <div id="main">
        <p class="foo">Hello, World</p>
        <div>
           <div class="bar">Hey, there!</div>
        </div>
     </div>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')

results = []

for element in soup.find_all(string=True):
    parent = element.parent
    while parent and not (parent.has_attr('id') or parent.has_attr('class')):
        parent = parent.parent

    if parent and element.strip() != '':
        if parent.has_attr('id'):
            results.append({
                "html": element.strip(),
                "selector": '#' + parent['id']
            })
        elif parent.has_attr('class'):
            results.append({
                "html": element.strip(),
                "selector": list(map(lambda cls: '.' + cls, parent['class']))
            })

print(results)

Upvotes: 1

Related Questions