Denis T.
Denis T.

Reputation: 133

How to scrape non-standard HTML with Nokogiri

How can I unpack a non-standard HTML:

<body>
    <div class="open">
        <div style='style'>Raw name 1</div>
        <p>Text_1</p>
        <p>Text_2</p>
        <p>Text_3</p>
        <p>Text_4</p>
        <p>Text_5</p>         
        <div style='style'>Raw name 5</div>
        <p>Text_1</p>
        <p>Text_2</p>
        <p>Text_3</p>
        <p>Text_4</p>
        <p>Text_5</p>
    </div>
</body>

I want to get a result similar to:

['Raw name 1', Text_1, Text_2, Text_3, Text_4, Text_5]
...
['Raw name 5', Text_1, Text_2, Text_3, Text_4, Text_5]

I tried to do something on this example How to parse a HTML table with Nokogiri?, but nothing happened.

Is it possible to obtain information that I want from such HTML?

Upvotes: 2

Views: 171

Answers (2)

the Tin Man
the Tin Man

Reputation: 160551

I'd do something like:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<body>
    <div class="open">
        <div style='style'>Raw name 1</div>
        <p>Text_1</p>
        <p>Text_2</p>         
        <div style='style'>Raw name 5</div>
        <p>Text_1</p>
        <p>Text_2</p>
    </div>
</body>
EOT

doc.at('.open').elements.slice_before { |e| e.name == 'div' }.map { |ary|
  ary.map(&:text)
}
# => [["Raw name 1", "Text_1", "Text_2"], ["Raw name 5", "Text_1", "Text_2"]]

Breaking it down a bit:

doc.at('.open').elements.map(&:name) # => ["div", "p", "p", "div", "p", "p"]
doc.at('.open').elements.slice_before { |e| e.name == 'div' }.map { |a| a.map(&:name) } # => [["div", "p", "p"], ["div", "p", "p"]]

elements and slice_before are the magic here.

Upvotes: 3

engineersmnky
engineersmnky

Reputation: 29328

If I understand correctly this might work for you

require 'nokogiri'
body = <<-BODY 
<body>
    <div class="open">
        <div style='style'>Raw name 1</div>
        <p>Text_1</p>
        <p>Text_2</p>
        <p>Text_3</p>
        <p>Text_4</p>
        <p>Text_5</p>         
        <div style='style'>Raw name 5</div>
        <p>Text_1</p>
        <p>Text_2</p>
        <p>Text_3</p>
        <p>Text_4</p>
        <p>Text_5</p>
    </div>
</body>   
BODY

doc = Nokogiri::HTML(body)
doc.xpath('//body/div').children.each_with_object({}) do |node,obj|
    text = node.text.strip
    obj[text] = [] if node.name == 'div'
    obj[obj.keys.last] << text if node.name == 'p'
end
#=> {"Raw name 1"=>["Text_1", "Text_2", "Text_3", "Text_4", "Text_5"], 
#     "Raw name 5"=>["Text_1", "Text_2", "Text_3", "Text_4", "Text_5"]}

Steps:

  • This follows the xpath to the first div (doc.xpath('//body/div'))
  • Then passes each child (.children) of that div to the block along with an object (.each_with_object({}) do |node,obj|) in this case as an accumulator.
  • It then adds a key for each div tag and assigns it to an empty array(obj[text] = [] if node.name == 'div').
  • It populates the last key with the following p tags (obj[obj.keys.last] << text if node.name == 'p')

The result is a Hash where the keys are the divs and the value is an Array of the following p tags text until it gets to the next div.

Upvotes: 3

Related Questions