Reputation: 20336
I'm trying to parse some HTML with BeautifulSoup, and I'd like to get all the text (recursively) in a tag, but I want to ignore all text that appears within a small
tag. For example, this HTML:
<li>
<a href="/path">
Final
</a>
definition.
<small>
Fun fact.
</small>
</li>
should give the text Final definition.
Note that this is a minimal example. In the real HTML, there are many other tags involved, so small
should be excluded rather than a
being included.
The text
attribute of the tag is close to what I want, but it would include Fun fact.
I could concatenate the text of all children except the small
tags, but that would leave out definition.
I couldn't find a method like get_text_until
(the small
tag is always at the end), so what can I do?
Upvotes: 0
Views: 145
Reputation: 19615
You can use find_all
to find all the <small>
tags, clear them, then use get_text()
:
>>> soup
<li>
<a href="/path">
Final
</a>
definition.
<small>
Fun fact.
</small>
</li>
>>> for el in soup.find_all("small"):
... el.clear()
...
>>> soup
<li>
<a href="/path">
Final
</a>
definition.
<small></small>
</li>
>>> soup.get_text()
'\n\n\n Final\n \n definition.\n \n\n'
Upvotes: 1
Reputation: 441
You can get this using recursive method state that you don't want to recurse into child tags: Like
soup.li.find(text=True, recursive=False)
So you can do this like
' '.join(li.find(text=True, recursive=False) for li in soup.findAll('li', 'a'))
Upvotes: 1