Reputation: 8572
I'm trying to use Nokogiri to parse an HTML file with some fairly eccentric markup. Specifically, I'm trying to grab divs which have both ids, multiple classes and styles defined.
The markup looks something like this:
<div id="foo">
<div id="bar" class="baz bang" style="display: block;">
<h2>title</h2>
<dl>
List of stuff
</dl>
</div>
</div>
I'm attempting to grab the <dl>
which sits inside the problem <div>
. I can get divs with a single id attribute with no problem, but I can't figure out a way of getting Nokogiri to grab divs with both ids and classes.
So these work fine:
content = @doc.xpath("//div[id='foo']")
content = @doc.css('div#foo')
But these don't return anything:
content = @doc.xpath("//div[id='bar']")
content = @doc.xpath("div#bar")
Is there something obvious that I'm missing here?
Upvotes: 7
Views: 9264
Reputation: 160621
I strongly recommend using CSS selectors as a starting point rather than XPath as CSS are more readable and less likely to result in visual noise.
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div id="foo">
<div id="bar" class="baz bang" style="display: block;">
<h2>title</h2>
<dl>
List of stuff
</dl>
</div>
</div>
EOT
Once that's parsed, using CSS to look for <div ... id="foo">
:
doc.at('div#foo').to_html
# => "<div id=\"foo\">\n" +
# " <div id=\"bar\" class=\"baz bang\" style=\"display: block;\">\n" +
# " <h2>title</h2>\n" +
# " <dl>\n" +
# " List of stuff\n" +
# " </dl>\n" +
# " </div>\n" +
# "</div>"
And <div id="bar">
:
doc.at('div#bar').to_html
# => "<div id=\"bar\" class=\"baz bang\" style=\"display: block;\">\n" +
# " <h2>title</h2>\n" +
# " <dl>\n" +
# " List of stuff\n" +
# " </dl>\n" +
# " </div>"
We can search for tags with both class names:
doc.at('.baz.bang').to_html
# => "<div id=\"bar\" class=\"baz bang\" style=\"display: block;\">\n" +
# " <h2>title</h2>\n" +
# " <dl>\n" +
# " List of stuff\n" +
# " </dl>\n" +
# " </div>"
And we can look for the explicit div with both classes and its embedded <dl>
tag:
doc.at('div.baz.bang dl').to_html
# => "<dl>\n" +
# " List of stuff\n" +
# " </dl>"
Or even by ID and classes:
doc.at('div#bar.baz.bang').to_html
# => "<div id=\"bar\" class=\"baz bang\" style=\"display: block;\">\n" +
# " <h2>title</h2>\n" +
# " <dl>\n" +
# " List of stuff\n" +
# " </dl>\n" +
# " </div>"
And with the <dl>
:
doc.at('div#bar.baz.bang dl').to_html
# => "<dl>\n" +
# " List of stuff\n" +
# " </dl>"
I'm using at
which is equivalent to using search(...some selector...).first
. Nokogiri supports search
and css
and xpath
, which are its CSS and XPath variations and which return a NodeSet, and at
, at_css
and at_xpath
which return a Node. It's important to understand the difference between "NodeSet
" and "Node" and how they relate to text
, and Searchable
, so read the documentation.
Upvotes: 0
Reputation:
You wrote:
I'm trying to grab divs which have both ids, multiple classes and styles defined
And
I'm attempting to grab the
<dl>
which sits inside the problem div
So, this XPath 1.0:
//div[@id][contains(normalize-space(@class),' ')][@style]/dl
Upvotes: 1
Reputation: 8116
The following works for me.
require 'rubygems'
require 'nokogiri'
html = %{
<div id="foo">
<div id="bar" class="baz bang" style="display: block;">
<h2>title</h2>
<dl>
List of stuff
</dl>
</div>
</div>
}
doc = Nokogiri::HTML.parse(html)
content = doc
.xpath("//div[@id='foo']/div[@id='bar' and @class='baz bang']/dl")
.inner_html
puts content
Upvotes: 3
Reputation: 243599
I can get divs with a single id attribute with no problem, but I can't figure out a way of getting Nokogiri to grab divs with both ids and classes.
You want:
//div[id='bar' and class='baz bang' and style='display: block;']
Upvotes: 4
Reputation: 13438
I think content = @doc.xpath("div#bar")
is a typo and should be content = @doc.css("div#bar")
or better content = @doc.css("#bar")
. The first expression in your second code chunk seems to be ok.
Upvotes: 1