JansthcirlU
JansthcirlU

Reputation: 737

How to scrape nested text between tags using BeautifulSoup?

I found a website using the following HTML structure somewhere:

...
<td>
  <span>some span text</span>
  some td text
</td>
...

I'm interested in retrieving the "some td text" and not the "some span text" but the get_text() method seems to return all the text as "some span textsome td text". Is there a way to get just the text inside a certain element using BeautifulSoup?

Not all the tds follow the same structure, so unfortunately I cannot predict the structure of the resulting string to trim it where necessary.

Upvotes: 1

Views: 344

Answers (1)

costaparas
costaparas

Reputation: 5237

Each element has a name attribute, which tells you the type of tag, e.g. div, td, span. In the case there is no tag (bare content), it will be None.

So you can just use a simple list comprehension to filter out all the tag elements.

from bs4 import BeautifulSoup

html = '''
<td>
  <span>some span text</span>
  some td text
</td>
'''

soup = BeautifulSoup(html, 'html.parser')
content = soup.find('td')
text = [c.strip() for c in content if c.name is None and c.strip() != '']
print(text)

This will print:

['some td text']

after some cleaning of newlines and empty strings.

If you wanted to join up the content afterwards, you could use join:

print('\n'.join(text))

Upvotes: 2

Related Questions