zinky
zinky

Reputation: 151

Selecting Paragraphs in Groups with xPath in Ruby

I'm currently working on a little web scraping project with Ruby and xPath. Unfortunatly the website is very bad structured, which leads me to a litte problem:

<h3>Relevant Headline</h3>
<p class="class_a class_b">Content starts in this paragraph...</p>
<p class="class_a ">...but this content belongs to the preceding paragraph</p>
<p class="class_a class_b">Content starts in this paragraph...</p>
<p class="class_a ">...but this content belongs to the preceding paragraph</p>
<h3>Some other Headline</h3>

As you can see, there are 2 h3-Tags which frame several p-tags. I want all the framed p-tags to be selected. I found already the following xPath to do that:

h3[contains(text(),"Relevant")]/following-sibling::p[1 = count(preceding-sibling::h3[1] | ../h3[contains(text(),"Relevant")])]

But now comes the difficulty: two of these paragraphs above belong together. The paragraph with class_b (first one) begins a new data entry and the next one (second) belongs to this entry. With 3 and 4 it's the same. The problem is: Sometimes 3 paragraphs belong together, sometimes 4, but most of the time there is a pair of paragraphs belonging together.

How do I select these inner paragraphs by groups and combine them to one string in Ruby?

Upvotes: 2

Views: 368

Answers (2)

pguardiario
pguardiario

Reputation: 55002

It can be done with xpath but I think it's easier to group them with slice_before:

doc.search('*').slice_before{|n| n.name == 'h3'}.each do |h3_group|
  h3_group.slice_before{|n| n[:class] && n[:class]['class_b']}.to_a[1..-1].each do |p_group|
    puts p_group.map(&:text) * ' '
  end
end

UPDATE

Another option using css:

doc.search('p.class_b').each do |p|
  str, next_node = p.text, p
  while next_node = next_node.at('+ p:not([class*=class_b])')
    str += " #{next_node.text}"
  end
  puts str
end

Upvotes: 3

Justin Ko
Justin Ko

Reputation: 46846

If you do not mind using a combination of xpath and nokogiri, you can do:

paragraph_text = Array.new
doc.xpath('//p[preceding-sibling::h3[1][contains(text(), "Relevant")]]').each do |p|
    if p.attribute('class').text.include?('class_b')
        paragraph_text << p.content
    else
        paragraph_text[-1] += p.text
    end
end
puts paragraph_text
#=> ["Content starts in this paragraph......but this content belongs to the preceding paragraph",  "Content starts in this paragraph......but this content belongs to the preceding paragraph"]

Basically the xpath is used to get the paragraph tags. Then, using nokogiri/ruby, iterate through the paragraphs and formulate the strings.

Upvotes: 4

Related Questions