Reputation: 29777
I have this string
<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE en-note SYSTEM \"http://xml.evernote.com/pub/enml.dtd\">\n\n\n<en-note>\n<font size=\"5\">text_part_1</font><br><br>\n<font size=\"5\">text_part_2</font><br><br>\n<font size=\"5\">text_part_3</font>
I need to extract the text content, but also keep the <br>
elements. So the result would be
text_part_1<br><br>text_part_2<br><br>text_part_3
How can I use Nokogiri to do this?
Upvotes: 0
Views: 441
Reputation: 160551
Part of the problem is, your XML is illegal. <br>
is unterminated; It should be <br/>
in XML, or have a end-tag, i.e., </br>
.
Nokogiri is raising errors when trying to parse the XML as a result. If you check the errors
method after parsing you'll see something like:
[
#<Nokogiri::XML::SyntaxError: Premature end of data in tag br line 7>,
#<Nokogiri::XML::SyntaxError: Premature end of data in tag br line 7>,
#<Nokogiri::XML::SyntaxError: Premature end of data in tag br line 6>,
#<Nokogiri::XML::SyntaxError: Premature end of data in tag br line 6>,
#<Nokogiri::XML::SyntaxError: Premature end of data in tag en-note line 5>
]
Fix that, and Nokogiri will be able to process the XML correctly. At that point, you'll be able to do something simple like:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<!DOCTYPE en-note SYSTEM \"http://xml.evernote.com/pub/enml.dtd\">
<en-note>
<font size=\"5\">text_part_1</font><br/><br/>
<font size=\"5\">text_part_2</font><br/><br/>
<font size=\"5\">text_part_3</font>
EOT
doc.search('br').each do |br|
br.replace('##br##')
end
text = doc.content.gsub('##br##', '<br/>')
puts text
Here's the output with the corrected br
tags:
text_part_1<br/><br/>
text_part_2<br/><br/>
text_part_3
The simplest way to fix the XML is to run some cleanup code before parsing it, like:
doc = Nokogiri::XML(xml.gsub('<br>', '<br/>'))
where xml
is the variable containing your XML content.
Upvotes: 1
Reputation: 27374
How about:
html = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<!DOCTYPE en-note SYSTEM \"http://xml.evernote.com/pub/enml.dtd\">\n\n\n<en-note>\n<font size=\"5\">text_part_1</font><br><br>\n<font size=\"5\">text_part_2</font><br><br>\n<font size=\"5\">text_part_3</font>"
doc = Nokogiri::HTML(html)
str = ""
doc.traverse { |n| str << n.to_s if (n.name == "text" or n.name == "br") }
str #=> "text_part_1<br><br>text_part_2<br><br>text_part_3"
Upvotes: 0