TimD
TimD

Reputation: 8572

How to use Nokogiri and XPath to get nodes with multiple attributes

I'm trying to use Nokogiri to parse an HTML file with some fairly eccentric markup. Specifically, I'm trying to grab divs which have both ids, multiple classes and styles defined.

The markup looks something like this:

<div id="foo">
  <div id="bar" class="baz bang" style="display: block;">
    <h2>title</h2>
    <dl>
      List of stuff
    </dl>
  </div>
</div>

I'm attempting to grab the <dl> which sits inside the problem <div>. I can get divs with a single id attribute with no problem, but I can't figure out a way of getting Nokogiri to grab divs with both ids and classes.

So these work fine:

content = @doc.xpath("//div[id='foo']")
content = @doc.css('div#foo')

But these don't return anything:

content = @doc.xpath("//div[id='bar']")
content = @doc.xpath("div#bar")

Is there something obvious that I'm missing here?

Upvotes: 7

Views: 9264

Answers (5)

the Tin Man
the Tin Man

Reputation: 160621

I strongly recommend using CSS selectors as a starting point rather than XPath as CSS are more readable and less likely to result in visual noise.

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<div id="foo">
  <div id="bar" class="baz bang" style="display: block;">
    <h2>title</h2>
    <dl>
      List of stuff
    </dl>
  </div>
</div>
EOT

Once that's parsed, using CSS to look for <div ... id="foo">:

doc.at('div#foo').to_html 
# => "<div id=\"foo\">\n" +
#    "  <div id=\"bar\" class=\"baz bang\" style=\"display: block;\">\n" +
#    "    <h2>title</h2>\n" +
#    "    <dl>\n" +
#    "      List of stuff\n" +
#    "    </dl>\n" +
#    "  </div>\n" +
#    "</div>"

And <div id="bar">:

doc.at('div#bar').to_html 
# => "<div id=\"bar\" class=\"baz bang\" style=\"display: block;\">\n" +
#    "    <h2>title</h2>\n" +
#    "    <dl>\n" +
#    "      List of stuff\n" +
#    "    </dl>\n" +
#    "  </div>"

We can search for tags with both class names:


doc.at('.baz.bang').to_html
# => "<div id=\"bar\" class=\"baz bang\" style=\"display: block;\">\n" +
#    "    <h2>title</h2>\n" +
#    "    <dl>\n" +
#    "      List of stuff\n" +
#    "    </dl>\n" +
#    "  </div>"

And we can look for the explicit div with both classes and its embedded <dl> tag:

doc.at('div.baz.bang dl').to_html
# => "<dl>\n" +
#    "      List of stuff\n" +
#    "    </dl>"

Or even by ID and classes:

doc.at('div#bar.baz.bang').to_html
# => "<div id=\"bar\" class=\"baz bang\" style=\"display: block;\">\n" +
#    "    <h2>title</h2>\n" +
#    "    <dl>\n" +
#    "      List of stuff\n" +
#    "    </dl>\n" +
#    "  </div>"

And with the <dl>:

doc.at('div#bar.baz.bang dl').to_html
# => "<dl>\n" +
#    "      List of stuff\n" +
#    "    </dl>"

I'm using at which is equivalent to using search(...some selector...).first. Nokogiri supports search and css and xpath, which are its CSS and XPath variations and which return a NodeSet, and at, at_css and at_xpath which return a Node. It's important to understand the difference between "NodeSet" and "Node" and how they relate to text, and Searchable, so read the documentation.

Upvotes: 0

user357812
user357812

Reputation:

You wrote:

I'm trying to grab divs which have both ids, multiple classes and styles defined

And

I'm attempting to grab the <dl> which sits inside the problem div

So, this XPath 1.0:

//div[@id][contains(normalize-space(@class),' ')][@style]/dl

Upvotes: 1

AboutRuby
AboutRuby

Reputation: 8116

The following works for me.

require 'rubygems'
require 'nokogiri'

html = %{
<div id="foo">
  <div id="bar" class="baz bang" style="display: block;">
    <h2>title</h2>
    <dl>
      List of stuff
    </dl>
  </div>
</div>
}

doc = Nokogiri::HTML.parse(html)
content = doc
  .xpath("//div[@id='foo']/div[@id='bar' and @class='baz bang']/dl")
  .inner_html

puts content

Upvotes: 3

Dimitre Novatchev
Dimitre Novatchev

Reputation: 243599

I can get divs with a single id attribute with no problem, but I can't figure out a way of getting Nokogiri to grab divs with both ids and classes.

You want:

//div[id='bar' and class='baz bang' and style='display: block;']

Upvotes: 4

Daniel O&#39;Hara
Daniel O&#39;Hara

Reputation: 13438

I think content = @doc.xpath("div#bar") is a typo and should be content = @doc.css("div#bar") or better content = @doc.css("#bar"). The first expression in your second code chunk seems to be ok.

Upvotes: 1

Related Questions