J-fish
J-fish

Reputation: 228

How to get Nokogiri inner_HTML object to ignore/remove escape sequences

Currently, I am trying to get the inner HTML of an element on a page using nokogiri. However I'm not just getting the text of the element, I'm also getting its escape sequences. Is there a way i can suppress or remove them with nokogiri?

require 'nokogiri'
require 'open-uri'

page = Nokogiri::HTML(open("http://the.page.url.com"))

page.at_css("td[custom-attribute='foo']").parent.css('td').css('a').inner_html

this returns => "\r\n\t\t\t\t\t\t\t\tTheActuallyInnerContentThatIWant\r\n\t"

What is the most effective and direct nokogiri (or ruby) way of doing this?

Upvotes: 2

Views: 2071

Answers (2)

Aleksei Matiushkin
Aleksei Matiushkin

Reputation: 121000

page.at_css("td[custom-attribute='foo']")
    .parent
    .css('td')
    .css('a')
    .text               # since you need a text, not inner_html
    .strip              # this will strip a result

String#strip.

Sidenote: css('td a') is likely more efficient than css('td').css('a').

Upvotes: 3

the Tin Man
the Tin Man

Reputation: 160551

It's important to drill in to the closest node containing the text you want. Consider this:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p>foo</p>
  </body>
</html>
EOT

doc.at('body').inner_html # => "\n    <p>foo</p>\n  "
doc.at('body').text # => "\n    foo\n  "
doc.at('p').inner_html # => "foo"
doc.at('p').text # => "foo"

at, at_css and at_xpath return a Node/XML::Element. search, css and xpath return a NodeSet. There's a big difference in how text or inner_html return information when looking at a Node or NodeSet:

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <p>foo</p>
    <p>bar</p>
  </body>
</html>
EOT

doc.at('p') # => #<Nokogiri::XML::Element:0x3fd635cf36f4 name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf3514 "foo">]>
doc.search('p') # => [#<Nokogiri::XML::Element:0x3fd635cf36f4 name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf3514 "foo">]>, #<Nokogiri::XML::Element:0x3fd635cf32bc name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf30dc "bar">]>]

doc.at('p').class # => Nokogiri::XML::Element
doc.search('p').class # => Nokogiri::XML::NodeSet

doc.at('p').text # => "foo"
doc.search('p').text # => "foobar"

Notice that using search returned a NodeSet and that text returned the node's text concatenated together. This is rarely what you want.

Also notice that Nokogiri is smart enough to figure out whether a selector is CSS or XPath 99% of the time, so using the generic search and at for either type of selector is very convenient.

Upvotes: 1

Related Questions