Get contents of html elements with arbitary degrees of nesting (along with xpath of contents)

Question

I'm looking for a way to read the text contents (i.e. no HTML code) of HTML elements with arbitrary degrees of nesting.

If there were no nesting it would be easy enough, but since HTML isn't a regular language, others with the same problem have been told to use (X)HTML parsers.

Is it possible to do this with beautiful soup? Something like:

page = soup.find('*').getText()  # obviously this won't give xpath info

I can imagine using a generator to feed different tag names into the find function, but I won't know what the tag names are. I would also need to return something like an xpath reference to the element with the text so that I would know the source of the contents that are eventually returned from the find function.

So, for the following HTML:


  text of div 1
  
     text of span 1
     
       text of span 2

I would need a function to return something like:

('text of div 1', '/div'), ('text of span 1', '/div/span'), ('text of span 2', '/div/span/span')

Keyur Potdar · Accepted Answer

I've written a recursive function that'll return the XPATHs of all the texts in the tag in a dictionary with the following format:

{'xpath1': {'text': 'text1'}, 'xpath2': {'text': 'text2'}, ...}

Code:

from bs4 import BeautifulSoup, NavigableString

def get_xpaths_dict(soup, xpaths={}, curr_path=''):
    curr_path += '/{}'.format(soup.name)
    for item in soup.contents:
        if isinstance(item, NavigableString):
            if item.strip():
                try:
                    xpaths[curr_path]['count'] += 1
                    count = xpaths[curr_path]['count']
                    curr_path += '[{}]'.format(count)
                    xpaths[curr_path] = {'text': item.strip()}
                except KeyError:
                    xpaths[curr_path] = {'text': item.strip(), 'count': 1}
        else:
            xpaths = get_xpaths_dict(item, xpaths, curr_path)
    return xpaths

html = '''
  text of div 1
  
     text of span 1.1
     
       text of span 2.1
     
     
       text of span 2.2
       
         text of span 3
       
     
  
'''
soup = BeautifulSoup(html, 'html.parser')

xpaths = get_xpaths_dict(soup.div)
print(xpaths)

Output:

{'/div': {'text': 'text of div 1', 'count': 1}, '/div/span': {'text': 'text of span 1.1', 'count': 1}, '/div/span/span': {'text': 'text of span 2.1', 'count': 2}, '/div/span/span[2]': {'text': 'text of span 2.2'}, '/div/span/span[2]/span': {'text': 'text of span 3', 'count': 1}}

I know this is not the format in which you were expecting the output. But, you can convert this into any format you want. For example, to convert this into your expected output, simply do the following:

expected_output = [(v['text'], k) for k, v in xpaths.items()]
print(expected_output)

Output:

[('text of div 1', '/div'), ('text of span 1.1', '/div/span'), ('text of span 2.1', '/div/span/span'), ('text of span 2.2', '/div/span/span[2]'), ('text of span 3', '/div/span/span[2]/span')]

Some explanation:

The extra key count in the dictionary is used to store the number of tags with the same name in the current tag. Using this format (dictionary) optimizes the code a lot. You will visit each tag only once.

Bonus:

As, the function returns a dictionary with XPATHs as the keys, you can get any text using an XPATH. For example:

xpaths = get_xpaths_dict(soup.div)
print(xpaths['/div/span/span[2]/span']['text'])
# text of span 3

Get contents of html elements with arbitary degrees of nesting (along with xpath of contents)

Answers (2)

Related Questions