zondo
zondo

Reputation: 20336

Get all text in a tag unless it is in another tag

I'm trying to parse some HTML with BeautifulSoup, and I'd like to get all the text (recursively) in a tag, but I want to ignore all text that appears within a small tag. For example, this HTML:

<li>
  <a href="/path">
    Final
  </a>
  definition.
  <small>
    Fun fact.
  </small>
</li>

should give the text Final definition. Note that this is a minimal example. In the real HTML, there are many other tags involved, so small should be excluded rather than a being included.

The text attribute of the tag is close to what I want, but it would include Fun fact. I could concatenate the text of all children except the small tags, but that would leave out definition. I couldn't find a method like get_text_until (the small tag is always at the end), so what can I do?

Upvotes: 0

Views: 145

Answers (2)

Wander Nauta
Wander Nauta

Reputation: 19615

You can use find_all to find all the <small> tags, clear them, then use get_text():

>>> soup

<li>
<a href="/path">
    Final
  </a>
  definition.
  <small>
    Fun fact.
  </small>
</li>

>>> for el in soup.find_all("small"):
...     el.clear()
...
>>> soup

<li>
<a href="/path">
    Final
  </a>
  definition.
  <small></small>
</li>

>>> soup.get_text()
'\n\n\n    Final\n  \n  definition.\n  \n\n'

Upvotes: 1

Hemel
Hemel

Reputation: 441

You can get this using recursive method state that you don't want to recurse into child tags: Like

soup.li.find(text=True, recursive=False)

So you can do this like

' '.join(li.find(text=True, recursive=False) for li in soup.findAll('li', 'a'))

Upvotes: 1

Related Questions