Reputation: 283

Nokogiri grab only visible inner_text

Is there a better way to extract the visible text on a web page using Nokogiri? Currently I use the inner_text method, however that method counts a lot of JavaScript as visible text. The only text I want to capture is the visible text on the screen.

For example, in IRB if I do the following in Ruby 1.9.2-p290:

require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
words = doc.inner_text
words.scan(/\w+/)

If I search for the word "function" I see that it appears 20 times in the list, however if I go to http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX the word "function" does not appear anywhere in the visible text.

Can I ignore JavaScript or is there a better way of doing this?

Upvotes: 3

Answers (4)

yeniv

Reputation: 93

You can remove all script elements from Nokogiri objects.

In your case, you could use:

doc = Nokogiri::HTML(open("http://www.bodybuilding.com"))
doc = doc.xpath("//script").remove

Upvotes: 0

user137369

Reputation: 5706

Ignore the tags where JavaScript lives (<script>). While we’re at it, we should also ignore CSS (<styles>).

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(URI.open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
doc.css('style').each(&:remove)
doc.css('script').each(&:remove)

puts doc.text

# Alternatively, for cleaner output:
# puts doc.text.split("\n").map(&:strip).reject(&:empty?)

Upvotes: 1

Justin Ko

Reputation: 46836

You could try:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))

doc.traverse{ |x|
    if x.text? && x.text !~ /^\s*$/
        puts x.text
    end
}

I have not done much with Nokogiri, but I believe this should find/output all text nodes in the document that are not blanks. This at least seems to be ignoring the javascript and all the text I checked was visible on the page (though some of it in the dropdown menus).

Upvotes: 7

the Tin Man

Reputation: 160551

You can ignore JavaScript and there is a better way. You're ignoring the power of Nokogiri. Badly.

Rather than provide you with the direct answer, it will do you well to learn to "fish" using Nokogiri.

In a document like:

<html>
  <body>
    <p>foo</p>
    <p>bar</p>
  </body>
</html>

I recommend starting with CSS accessors because they're generally more familiar to people:

doc = Nokogiri::HTML(var_containing_html) will parse and return the HTML DOM in doc.
doc.at('p') will return a Node, which basically points to the first <p> node.
doc.search('p') will return a NodeSet of all matching nodes, which acts like an array, in this case all <p> nodes.
doc.at('p').text will return the text inside a node.
doc.search('p').map{ |n| n.text } will return all the text in the <p> nodes as an array of text strings.

As your document gets more complex you need to drill down. Sometimes you can do it using a CSS accessor, such as 'body p' or something similar, and sometimes you need to use XPaths. I won't go into those but there are great tutorials and references out there.

Nokogiri's tutorials are very good. Go through them and they will reveal all you need to know.

In addition, there are many answers on Stack Overflow discussing this sort of problem. Check out the "Related" links on the right of the page.

Upvotes: 2

Nokogiri grab only visible inner_text

Answers (4)

Related Questions