Parse HTML (without HTML semantics being followed) using Nokogiri

Question

I have an HTML document containing data:

while parsing I use:

div_node.children.each do |child|
  if child.node_name == 'p'
    #store it as html string in db
    store(child.to_html)
  end
end

When I check the database, I get only the outer

tag:

No inner

I know that the

tag cannot contain the

ashishmohite · Accepted Answer

I ended up using Nokogiri::XML parser for parsing the HTML doc

I had to change my script at numerous places

Parsing code

@xml_doc = Nokogiri::XML.parse(file) { |cfg| cfg.noblanks }
@xml_doc.remove_namespaces!

Changes Done

change attribute method to attr
chaining attr with text method is not needed here
need to check about the invalid HTML5 tags though
some more parsing logic changes were needed
node.to_html works like a charm here so i was able to store complete HTML in db

Answers (2)