Reputation: 15
I'm trying to scrape text with beautiful soup and I need to get text from inside a span with a specific class but discard the superscript numbers inside the same span with a different class. I can very easily use get_text to pull the number and the contents from the span but I end up with the superscript numbers as well. The solution needs to be able to discard each instance of the sup tag as well as its text contents.
Example HTML:
<span class="woj">
<sup class="versenum">
16
</sup>
The text I want
</span>
What I get right now: 16 The text I want
What I want: The text I want
Upvotes: 1
Views: 1154
Reputation: 1
You can use this logic:
foreach(var sup in node.SelectNodes("//sup")) {
sup.Remove();
}
Upvotes: 0
Reputation: 3854
You can extract all sup tags using .sup.extract()
html = '<span class="woj"><sup class="versenum">16</sup>The text I want</span>'
parsed_element = bs.BeautifulSoup(html, 'html.parser')
[s.extract() for s in parsed_element('sup')]
text = parsed_element.text
Upvotes: 1