David J.
David J.

Reputation: 1913

Get contents of html elements with arbitary degrees of nesting (along with xpath of contents)

I'm looking for a way to read the text contents (i.e. no HTML code) of HTML elements with arbitrary degrees of nesting.

If there were no nesting it would be easy enough, but since HTML isn't a regular language, others with the same problem have been told to use (X)HTML parsers.

Is it possible to do this with beautiful soup? Something like:

page = soup.find('*').getText()  # obviously this won't give xpath info

I can imagine using a generator to feed different tag names into the find function, but I won't know what the tag names are. I would also need to return something like an xpath reference to the element with the text so that I would know the source of the contents that are eventually returned from the find function.

So, for the following HTML:

<div>
  text of div 1
  <span>
     text of span 1
     <span>
       text of span 2
     </span>
  </span>
</div>

I would need a function to return something like:

('text of div 1', '/div'), ('text of span 1', '/div/span'), ('text of span 2', '/div/span/span')

Upvotes: 3

Views: 92

Answers (2)

Keyur Potdar
Keyur Potdar

Reputation: 7248

I've written a recursive function that'll return the XPATHs of all the texts in the tag in a dictionary with the following format:

{'xpath1': {'text': 'text1'}, 'xpath2': {'text': 'text2'}, ...}

Code:

from bs4 import BeautifulSoup, NavigableString

def get_xpaths_dict(soup, xpaths={}, curr_path=''):
    curr_path += '/{}'.format(soup.name)
    for item in soup.contents:
        if isinstance(item, NavigableString):
            if item.strip():
                try:
                    xpaths[curr_path]['count'] += 1
                    count = xpaths[curr_path]['count']
                    curr_path += '[{}]'.format(count)
                    xpaths[curr_path] = {'text': item.strip()}
                except KeyError:
                    xpaths[curr_path] = {'text': item.strip(), 'count': 1}
        else:
            xpaths = get_xpaths_dict(item, xpaths, curr_path)
    return xpaths

html = '''<div>
  text of div 1
  <span>
     text of span 1.1
     <span>
       text of span 2.1
     </span>
     <span>
       text of span 2.2
       <span>
         text of span 3
       </span>
     </span>
  </span>
</div>'''
soup = BeautifulSoup(html, 'html.parser')

xpaths = get_xpaths_dict(soup.div)
print(xpaths)

Output:

{'/div': {'text': 'text of div 1', 'count': 1}, '/div/span': {'text': 'text of span 1.1', 'count': 1}, '/div/span/span': {'text': 'text of span 2.1', 'count': 2}, '/div/span/span[2]': {'text': 'text of span 2.2'}, '/div/span/span[2]/span': {'text': 'text of span 3', 'count': 1}}

I know this is not the format in which you were expecting the output. But, you can convert this into any format you want. For example, to convert this into your expected output, simply do the following:

expected_output = [(v['text'], k) for k, v in xpaths.items()]
print(expected_output)

Output:

[('text of div 1', '/div'), ('text of span 1.1', '/div/span'), ('text of span 2.1', '/div/span/span'), ('text of span 2.2', '/div/span/span[2]'), ('text of span 3', '/div/span/span[2]/span')]

Some explanation:

The extra key count in the dictionary is used to store the number of tags with the same name in the current tag. Using this format (dictionary) optimizes the code a lot. You will visit each tag only once.

Bonus:

As, the function returns a dictionary with XPATHs as the keys, you can get any text using an XPATH. For example:

xpaths = get_xpaths_dict(soup.div)
print(xpaths['/div/span/span[2]/span']['text'])
# text of span 3

Upvotes: 2

Karl Anka
Karl Anka

Reputation: 2869

What about this:

result_set = []

for tag in soup.find_all():
    parent_list = []
    content_of_tag = tag.find(text=True)

    parent_list.append(tag.name)

    while tag.parent is not None:
        tag = tag.parent
        parent_list.append(tag.name)

    result_set.append((content_of_tag, parent_list))

The first find_all() will find all tags of all types on all levels. Iterating over these tag.find(text=True) finds the first text in each of these tags. parent_list.append(tag.name) before the loop adds the current tags name to the parent list. The while loop then finds all of tags parents, and adds their names to the parent list.

Upvotes: 2

Related Questions