How to scrape non-standard HTML with Nokogiri

Question

How can I unpack a non-standard HTML:


    
        Raw name 1
        Text_1
        Text_2
        Text_3
        Text_4
        Text_5         
        Raw name 5
        Text_1
        Text_2
        Text_3
        Text_4
        Text_5

I want to get a result similar to:

['Raw name 1', Text_1, Text_2, Text_3, Text_4, Text_5]
...
['Raw name 5', Text_1, Text_2, Text_3, Text_4, Text_5]

I tried to do something on this example How to parse a HTML table with Nokogiri?, but nothing happened.

Is it possible to obtain information that I want from such HTML?

engineersmnky · Accepted Answer

If I understand correctly this might work for you

require 'nokogiri'
body = <<-BODY 

    
        Raw name 1
        Text_1
        Text_2
        Text_3
        Text_4
        Text_5         
        Raw name 5
        Text_1
        Text_2
        Text_3
        Text_4
        Text_5
    
   
BODY

doc = Nokogiri::HTML(body)
doc.xpath('//body/div').children.each_with_object({}) do |node,obj|
    text = node.text.strip
    obj[text] = [] if node.name == 'div'
    obj[obj.keys.last] << text if node.name == 'p'
end
#=> {"Raw name 1"=>["Text_1", "Text_2", "Text_3", "Text_4", "Text_5"], 
#     "Raw name 5"=>["Text_1", "Text_2", "Text_3", "Text_4", "Text_5"]}

Steps:

This follows the xpath to the first div (doc.xpath('//body/div'))
Then passes each child (.children) of that div to the block along with an object (.each_with_object({}) do |node,obj|) in this case as an accumulator.
It then adds a key for each div tag and assigns it to an empty array(obj[text] = [] if node.name == 'div').
It populates the last key with the following p tags (obj[obj.keys.last] << text if node.name == 'p')

The result is a Hash where the keys are the divs and the value is an Array of the following p tags text until it gets to the next div.

How to scrape non-standard HTML with Nokogiri

Answers (2)

Related Questions