Reputation: 1120
I have an HTML document containing data:
<div>
<p class="someclass">
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
</p>
</div>
while parsing I use:
div_node.children.each do |child|
if child.node_name == 'p'
#store it as html string in db
store(child.to_html)
end
end
When I check the database, I get only the outer <p>
tag:
<p class="someclass">
</p>
No inner <ul>
tag content is stored or retrieved.
I know that the <p>
tag cannot contain the <ul>
tag but the document we got from the client has the data and there are about 1000 documents with the data so I cannot edit them manually
Upvotes: 2
Views: 554
Reputation: 1120
I ended up using Nokogiri::XML
parser for parsing the HTML
doc
I had to change my script at numerous places
Parsing code
@xml_doc = Nokogiri::XML.parse(file) { |cfg| cfg.noblanks }
@xml_doc.remove_namespaces!
Changes Done
attribute
method to attr
attr
with text
method is not needed herenode.to_html
works like a charm here so i was able to store complete HTML in dbUpvotes: 1
Reputation: 6122
Try to use the Nokogiri::XML
parser instead of the Nokogiri::HTML
one. It shouldn't care about the tag semantics, but I'm not sure how will it handle those parts of HTML5 which are not valid XML.
Upvotes: 1