How to find a node given the exact HTML tag as a string (using Nokogiri)?

Question

Question

I need to search a given web page for a particular node when given the exact HTML as a string. For instance, if given:

url = "https://www.wikipedia.org/"
node_to_find = "Wikipedia"

I want to "select" the node on the page (and eventually return its children and sibling nodes). I'm having trouble with the Nokogiri docs, and how to exactly go about this. It seems as though, most of the time, people want to use Xpath syntax or the #css method to find nodes that satisfy a set of conditions. I want to use the HTML syntax and just find the exact match within a webpage.

Possible start of a solution?

If I create two Nokogiri::HTML::DocumentFragment objects, they look similar but do not match due to the memory id being different. I think this might be a precursor to solving it?

irb(main):018:0> n = Nokogiri::HTML::DocumentFragment.parse(Wikipedia").child

=> #]>

irb(main):019:0> n.class

=> Nokogiri::XML::Element

Then I create a second one using the exact same arguments. Compare them - it returns false:

irb(main):020:0> x = Nokogiri::HTML::DocumentFragment.parse("Wikipedia").child 

=> #]>

irb(main):021:0> n == x

=> false

So I'm thinking that if I can somehow create a method that can find matches like this, then I can perform operations of that node. In particular - I want to find the descendents (children and next sibling).

EDIT: I should mention that I have a method in my code that creates a Nokogiri::HTML::Document object from a given URL. So - that will be available to compare with.

class Page
attr_accessor :url, :node, :doc, :root

def initialize(params = {})
  @url = params.fetch(:url, "").to_s
  @node = params.fetch(:node, "").to_s
  @doc = parse_html(@url)
end

def parse_html(url)
  Nokogiri::HTML(open(url).read)  
end

end

maerics · Accepted Answer

As suggested by commenter @August, you could use Node#traverse to see if the string representation of any node matches the string form of your target node.

def find_node(html_document, html_fragment)
  matching_node = nil
  html_document.traverse do |node|
    matching_node = node if node.to_s == html_fragment.to_s
  end
  matching_node
end

Of course, this approach is fraught with problems that boil down to the canonical representation of the data (do you care about attribute ordering? specific syntax items like quotation marks? whitespace?).

[Edit] Here's a prototype of converting an arbitrary HTML element to an XPath expression. It needs some work but the basic idea (match any element with the node name, specific attributes, and possibly text child) should be a good starting place.

def html_to_xpath(html_string)
  node = Nokogiri::HTML::fragment(html_string).children.first
  has_more_than_one_child = (node.children.size > 1)
  has_non_text_child = node.children.any? { |x| x.type != Nokogiri::XML::Node::TEXT_NODE }
  if has_more_than_one_child || has_non_text_child
    raise ArgumentError.new('element may only have a single text child')
  end
  xpath = "//#{node.name}"
  node.attributes.each do |_, attr|
    xpath += "[#{attr.name}='#{attr.value}']" # TODO: escaping.
  end
  xpath += "[text()='#{node.children.first.to_s}']" unless node.children.empty?
  xpath
end
html_to_xpath('Wikipedia') # => "//title[text()='Wikipedia']"
html_to_xpath('Foo')  # => "//div[id='foo'][text()='Foo']"
html_to_xpath('
') # => ArgumentError: element may only have a single text child

It seems possible that you could build an XPath from any HTML fragment (e.g. not restricted to those with only a single text child, per my prototype above) but I'll leave that as an exercise for the reader ;-)

How to find a node given the exact HTML tag as a string (using Nokogiri)?

Answers (1)

Related Questions