Kshitiz Gupta
Kshitiz Gupta

Reputation: 248

What how I parse an html string with beautifulsoup that has inner tags within text

I have a the following html content in a variable and need a way to read the text from the html by removing the inner tags html=<td class="row">India (ASIA) (<a href="/asia/india">india</a>&nbsp;–&nbsp;<a href="/asia/india">photos</a>)</td>

I just want to extract the string India (ASIA) out of this with BeautifulSoup. Is it possible or should I resort to use regular expressions for this.

Upvotes: 4

Views: 9563

Answers (1)

har07
har07

Reputation: 89285

This is one possible way using beautifulsoup, by extracting text content before child element <a> :

from bs4 import BeautifulSoup

html = """<td class="row">India (ASIA) (<a href="/asia/india">india</a>&nbsp;–&nbsp;<a href="/asia/india">photos</a>)</td>"""
soup = BeautifulSoup(html)
result = soup.find("a").previousSibling
print(result.decode('utf-8'))

output :

India (ASIA) (

tweaking the code further to remove trailing ( from result should be straightforward

Upvotes: 4

Related Questions