Reputation: 737
I found a website using the following HTML structure somewhere:
...
<td>
<span>some span text</span>
some td text
</td>
...
I'm interested in retrieving the "some td text"
and not the "some span text"
but the get_text()
method seems to return all the text as "some span textsome td text"
. Is there a way to get just the text inside a certain element using BeautifulSoup?
Not all the td
s follow the same structure, so unfortunately I cannot predict the structure of the resulting string to trim it where necessary.
Upvotes: 1
Views: 344
Reputation: 5237
Each element has a name
attribute, which tells you the type of tag, e.g. div
, td
, span
. In the case there is no tag (bare content), it will be None
.
So you can just use a simple list comprehension to filter out all the tag elements.
from bs4 import BeautifulSoup
html = '''
<td>
<span>some span text</span>
some td text
</td>
'''
soup = BeautifulSoup(html, 'html.parser')
content = soup.find('td')
text = [c.strip() for c in content if c.name is None and c.strip() != '']
print(text)
This will print:
['some td text']
after some cleaning of newlines and empty strings.
If you wanted to join up the content afterwards, you could use join
:
print('\n'.join(text))
Upvotes: 2