Reputation: 248
I have a the following html content in a variable and need a way to read the text from the html by removing the inner tags
html=<td class="row">India (ASIA) (<a href="/asia/india">india</a> – <a href="/asia/india">photos</a>)</td>
I just want to extract the string India (ASIA)
out of this with BeautifulSoup. Is it possible or should I resort to use regular expressions for this.
Upvotes: 4
Views: 9563
Reputation: 89285
This is one possible way using beautifulsoup, by extracting text content before child element <a>
:
from bs4 import BeautifulSoup
html = """<td class="row">India (ASIA) (<a href="/asia/india">india</a> – <a href="/asia/india">photos</a>)</td>"""
soup = BeautifulSoup(html)
result = soup.find("a").previousSibling
print(result.decode('utf-8'))
output :
India (ASIA) (
tweaking the code further to remove trailing (
from result
should be straightforward
Upvotes: 4