Reputation: 151
I'm currently working on a little web scraping project with Ruby and xPath. Unfortunatly the website is very bad structured, which leads me to a litte problem:
<h3>Relevant Headline</h3>
<p class="class_a class_b">Content starts in this paragraph...</p>
<p class="class_a ">...but this content belongs to the preceding paragraph</p>
<p class="class_a class_b">Content starts in this paragraph...</p>
<p class="class_a ">...but this content belongs to the preceding paragraph</p>
<h3>Some other Headline</h3>
As you can see, there are 2 h3-Tags which frame several p-tags. I want all the framed p-tags to be selected. I found already the following xPath to do that:
h3[contains(text(),"Relevant")]/following-sibling::p[1 = count(preceding-sibling::h3[1] | ../h3[contains(text(),"Relevant")])]
But now comes the difficulty: two of these paragraphs above belong together. The paragraph with class_b (first one) begins a new data entry and the next one (second) belongs to this entry. With 3 and 4 it's the same. The problem is: Sometimes 3 paragraphs belong together, sometimes 4, but most of the time there is a pair of paragraphs belonging together.
How do I select these inner paragraphs by groups and combine them to one string in Ruby?
Upvotes: 2
Views: 368
Reputation: 55002
It can be done with xpath but I think it's easier to group them with slice_before:
doc.search('*').slice_before{|n| n.name == 'h3'}.each do |h3_group|
h3_group.slice_before{|n| n[:class] && n[:class]['class_b']}.to_a[1..-1].each do |p_group|
puts p_group.map(&:text) * ' '
end
end
UPDATE
Another option using css:
doc.search('p.class_b').each do |p|
str, next_node = p.text, p
while next_node = next_node.at('+ p:not([class*=class_b])')
str += " #{next_node.text}"
end
puts str
end
Upvotes: 3
Reputation: 46846
If you do not mind using a combination of xpath and nokogiri, you can do:
paragraph_text = Array.new
doc.xpath('//p[preceding-sibling::h3[1][contains(text(), "Relevant")]]').each do |p|
if p.attribute('class').text.include?('class_b')
paragraph_text << p.content
else
paragraph_text[-1] += p.text
end
end
puts paragraph_text
#=> ["Content starts in this paragraph......but this content belongs to the preceding paragraph", "Content starts in this paragraph......but this content belongs to the preceding paragraph"]
Basically the xpath is used to get the paragraph tags. Then, using nokogiri/ruby, iterate through the paragraphs and formulate the strings.
Upvotes: 4