Reputation: 741

Nokogiri, How to remove unnecessary html content?

I have an issue with nokogiri. Let say I have this HTML

<html> 
<p>
        This is just an example, how to remove the next sentence using nokogiri in Ruby.
        Thank you for your help.
        <strong> XXXX </strong>
            <br/> 
            <br />
        I want to remove all the HTML after the strong XXXX
            <br />
            <br />
        <strong> YYY </strong>
</p>

How can I just get "This is just an example, how to remove the next sentence using nokogiri ... Thank you for your help."? I don't want to include the HTML from <strong> XXXX till rest of it.

Upvotes: 1

Answers (3)

Matt McNaughton

Reputation: 216

If you're just trying to get the text (which is what I think you are asking), then you can call the text method on the Nokogiri element. That will return you "... Thank you for your help XXX I want to remove all the HTML after the strong XXXX YYY". Here's a link to the Nokogiri documentation if that's helpful - it talks about the text method. Or are you talking about trying to not get any of the text/html after the tag?

Upvotes: 0

Mark Thomas

Reputation: 37517

To exclude specifically, you may want to try

doc.search('//p/text()[not(preceding-sibling::strong)]').text

This says get all text nodes that are not after a strong.

Given your input, this extracts the following:

        This is just an example, how to remove the next sentence using nokogiri in Ruby.
        Thank you for your help.

Upvotes: 2

Arup Rakshit

Reputation: 118271

Hope you were looking for something like below:

require 'nokogiri'

doc = Nokogiri::HTML::Document.parse <<-_HTML_
<p>
        This is just an example, how to remove the next sentence using nokogiri in Ruby.
        Thank you for your help.
        <strong> XXXX </strong>
            <br/> 
            <br />
        I want to remove all the HTML after the strong XXXX
            <br />
            <br />
        <strong> YYY </strong>
</p>
_HTML_

puts doc.at('//p/text()[1]').to_s.strip
# >> This is just an example, how to remove the next sentence using nokogiri in Ruby.
# >>         Thank you for your help.

Now if you want to remove unwanted,as per you,html content from the source html itself then you can probably try the below :

require 'nokogiri'

doc = Nokogiri::HTML::Document.parse <<-_HTML_
<p>
        This is just an example, how to remove the next sentence using nokogiri in Ruby.
        Thank you for your help.
        <strong> XXXX </strong>
            <br/> 
            <br />
        I want to remove all the HTML after the strong XXXX
            <br />
            <br />
        <strong> YYY </strong>
</p>
_HTML_


doc.xpath('//p/* | //p/text()').count # => 10
ndst = doc.search('//p/* | //p/text()')[1..-1]
ndst.remove


puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body><p>
# >>         This is just an example, how to remove the next sentence using nokogiri in Ruby.
# >>         Thank you for your help.
# >>         </p></body></html>

Upvotes: 0

Nokogiri, How to remove unnecessary html content?

Answers (3)

Related Questions