antc
antc

Reputation: 15

Removing text in a <sup> tag from a span while scraping the rest of the text

I'm trying to scrape text with beautiful soup and I need to get text from inside a span with a specific class but discard the superscript numbers inside the same span with a different class. I can very easily use get_text to pull the number and the contents from the span but I end up with the superscript numbers as well. The solution needs to be able to discard each instance of the sup tag as well as its text contents.

Example HTML:

<span class="woj">
 <sup class="versenum">
  16
 </sup>
  The text I want
</span>

What I get right now: 16 The text I want

What I want: The text I want

Upvotes: 1

Views: 1154

Answers (2)

Sam
Sam

Reputation: 1

You can use this logic:

foreach(var sup in node.SelectNodes("//sup")) {
   sup.Remove();
}

Upvotes: 0

Michael Dz
Michael Dz

Reputation: 3854

You can extract all sup tags using .sup.extract()

html = '<span class="woj"><sup class="versenum">16</sup>The text I want</span>'

parsed_element = bs.BeautifulSoup(html, 'html.parser')
[s.extract() for s in parsed_element('sup')]
text = parsed_element.text

Upvotes: 1

Related Questions