Reputation: 133
How can I unpack a non-standard HTML:
<body>
<div class="open">
<div style='style'>Raw name 1</div>
<p>Text_1</p>
<p>Text_2</p>
<p>Text_3</p>
<p>Text_4</p>
<p>Text_5</p>
<div style='style'>Raw name 5</div>
<p>Text_1</p>
<p>Text_2</p>
<p>Text_3</p>
<p>Text_4</p>
<p>Text_5</p>
</div>
</body>
I want to get a result similar to:
['Raw name 1', Text_1, Text_2, Text_3, Text_4, Text_5]
...
['Raw name 5', Text_1, Text_2, Text_3, Text_4, Text_5]
I tried to do something on this example How to parse a HTML table with Nokogiri?, but nothing happened.
Is it possible to obtain information that I want from such HTML?
Upvotes: 2
Views: 171
Reputation: 160551
I'd do something like:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<body>
<div class="open">
<div style='style'>Raw name 1</div>
<p>Text_1</p>
<p>Text_2</p>
<div style='style'>Raw name 5</div>
<p>Text_1</p>
<p>Text_2</p>
</div>
</body>
EOT
doc.at('.open').elements.slice_before { |e| e.name == 'div' }.map { |ary|
ary.map(&:text)
}
# => [["Raw name 1", "Text_1", "Text_2"], ["Raw name 5", "Text_1", "Text_2"]]
Breaking it down a bit:
doc.at('.open').elements.map(&:name) # => ["div", "p", "p", "div", "p", "p"]
doc.at('.open').elements.slice_before { |e| e.name == 'div' }.map { |a| a.map(&:name) } # => [["div", "p", "p"], ["div", "p", "p"]]
elements
and slice_before
are the magic here.
Upvotes: 3
Reputation: 29328
If I understand correctly this might work for you
require 'nokogiri'
body = <<-BODY
<body>
<div class="open">
<div style='style'>Raw name 1</div>
<p>Text_1</p>
<p>Text_2</p>
<p>Text_3</p>
<p>Text_4</p>
<p>Text_5</p>
<div style='style'>Raw name 5</div>
<p>Text_1</p>
<p>Text_2</p>
<p>Text_3</p>
<p>Text_4</p>
<p>Text_5</p>
</div>
</body>
BODY
doc = Nokogiri::HTML(body)
doc.xpath('//body/div').children.each_with_object({}) do |node,obj|
text = node.text.strip
obj[text] = [] if node.name == 'div'
obj[obj.keys.last] << text if node.name == 'p'
end
#=> {"Raw name 1"=>["Text_1", "Text_2", "Text_3", "Text_4", "Text_5"],
# "Raw name 5"=>["Text_1", "Text_2", "Text_3", "Text_4", "Text_5"]}
Steps:
xpath
to the first div (doc.xpath('//body/div')
) .children
) of that div to the block along with an object (.each_with_object({}) do |node,obj|
) in this case as an accumulator. div
tag and assigns it to an empty array(obj[text] = [] if node.name == 'div'
). p
tags (obj[obj.keys.last] << text if node.name == 'p'
)The result is a Hash
where the keys are the divs
and the value is an Array
of the following p
tags text until it gets to the next div
.
Upvotes: 3