Reputation: 1913
I'm looking for a way to read the text contents (i.e. no HTML code) of HTML elements with arbitrary degrees of nesting.
If there were no nesting it would be easy enough, but since HTML isn't a regular language, others with the same problem have been told to use (X)HTML parsers.
Is it possible to do this with beautiful soup? Something like:
page = soup.find('*').getText() # obviously this won't give xpath info
I can imagine using a generator to feed different tag names into the find
function, but I won't know what the tag names are. I would also need to return something like an xpath reference to the element with the text so that I would know the source of the contents that are eventually returned from the find
function.
So, for the following HTML:
<div>
text of div 1
<span>
text of span 1
<span>
text of span 2
</span>
</span>
</div>
I would need a function to return something like:
('text of div 1', '/div'), ('text of span 1', '/div/span'), ('text of span 2', '/div/span/span')
Upvotes: 3
Views: 92
Reputation: 7248
I've written a recursive function that'll return the XPATHs of all the texts in the tag in a dictionary with the following format:
{'xpath1': {'text': 'text1'}, 'xpath2': {'text': 'text2'}, ...}
Code:
from bs4 import BeautifulSoup, NavigableString
def get_xpaths_dict(soup, xpaths={}, curr_path=''):
curr_path += '/{}'.format(soup.name)
for item in soup.contents:
if isinstance(item, NavigableString):
if item.strip():
try:
xpaths[curr_path]['count'] += 1
count = xpaths[curr_path]['count']
curr_path += '[{}]'.format(count)
xpaths[curr_path] = {'text': item.strip()}
except KeyError:
xpaths[curr_path] = {'text': item.strip(), 'count': 1}
else:
xpaths = get_xpaths_dict(item, xpaths, curr_path)
return xpaths
html = '''<div>
text of div 1
<span>
text of span 1.1
<span>
text of span 2.1
</span>
<span>
text of span 2.2
<span>
text of span 3
</span>
</span>
</span>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
xpaths = get_xpaths_dict(soup.div)
print(xpaths)
Output:
{'/div': {'text': 'text of div 1', 'count': 1}, '/div/span': {'text': 'text of span 1.1', 'count': 1}, '/div/span/span': {'text': 'text of span 2.1', 'count': 2}, '/div/span/span[2]': {'text': 'text of span 2.2'}, '/div/span/span[2]/span': {'text': 'text of span 3', 'count': 1}}
I know this is not the format in which you were expecting the output. But, you can convert this into any format you want. For example, to convert this into your expected output, simply do the following:
expected_output = [(v['text'], k) for k, v in xpaths.items()]
print(expected_output)
Output:
[('text of div 1', '/div'), ('text of span 1.1', '/div/span'), ('text of span 2.1', '/div/span/span'), ('text of span 2.2', '/div/span/span[2]'), ('text of span 3', '/div/span/span[2]/span')]
Some explanation:
The extra key count
in the dictionary is used to store the number of tags with the same name in the current tag. Using this format (dictionary) optimizes the code a lot. You will visit each tag only once.
Bonus:
As, the function returns a dictionary with XPATHs as the keys, you can get any text using an XPATH. For example:
xpaths = get_xpaths_dict(soup.div)
print(xpaths['/div/span/span[2]/span']['text'])
# text of span 3
Upvotes: 2
Reputation: 2869
What about this:
result_set = []
for tag in soup.find_all():
parent_list = []
content_of_tag = tag.find(text=True)
parent_list.append(tag.name)
while tag.parent is not None:
tag = tag.parent
parent_list.append(tag.name)
result_set.append((content_of_tag, parent_list))
The first find_all()
will find all tags of all types on all levels. Iterating over these tag.find(text=True)
finds the first text in each of these tags. parent_list.append(tag.name)
before the loop adds the current tags name to the parent list. The while loop then finds all of tags parents, and adds their names to the parent list.
Upvotes: 2