Reputation: 16734
I have a string which has plain text and extra spaces and carriage returns then XML-like tags followed by XML tags:
String = "hi there.
<SET-TOPIC> INITIATE </SET-TOPIC>
<SETPROFILE>
<KEY>name</KEY>
<VALUE>Joe</VALUE>
</SETPROFILE>
<SETPROFILE>
<KEY>email</KEY>
<VALUE>[email protected]</VALUE>
</SETPROFILE>
<GET-RELATIONS>
<COLLECTION>goals</COLLECTION>
<VALUE>walk upstairs</VALUE>
</GET-RELATIONS>
So what do you think?
Is it true?
"
I want to parse this similar to use Nori or Nokogiri or Ox where they convert XML to a hash.
My goal is to be able to easily pull out the top level tags as keys and then know all the elements, something like:
Keys = ['SETPROFILE', 'SETPROFILE', 'SET-TOPIC', 'GET-OBJECT']
Values[0] = [{name => Joe}, {email => [email protected]}]
Values[3] = [{collection => goals}, {value => walk up}]
I have seen several functions like that for true XML but all of mine are partial.
I started going down this line of thinking:
parsed = doc.search('*').each_with_object({}) do |n, h|
(h[n.name] ||= []) << n.text
end
Upvotes: 0
Views: 668
Reputation: 160551
I'd probably do something along these lines if I wanted the keys
and values
variables:
require 'nokogiri'
string = "hi there.
<SET-TOPIC> INITIATE </SET-TOPIC>
<SETPROFILE>
<KEY>name</KEY>
<VALUE>Joe</VALUE>
</SETPROFILE>
<SETPROFILE>
<KEY>email</KEY>
<VALUE>[email protected]</VALUE>
</SETPROFILE>
<GET-RELATIONS>
<COLLECTION>goals</COLLECTION>
<VALUE>walk upstairs</VALUE>
</GET-RELATIONS>
So what do you think?
Is it true?
"
doc = Nokogiri::XML('<root>' + string + '</root>', nil, nil, Nokogiri::XML::ParseOptions::NOBLANKS)
nodes = doc.root.children.reject { |n| n.is_a?(Nokogiri::XML::Text) }.map { |node|
[
node.name, node.children.map { |c|
[c.name, c.content]
}.to_h
]
}
nodes
# => [["SET-TOPIC", {"text"=>" INITIATE "}],
# ["SETPROFILE", {"KEY"=>"name", "VALUE"=>"Joe"}],
# ["SETPROFILE", {"KEY"=>"email", "VALUE"=>"[email protected]"}],
# ["GET-RELATIONS", {"COLLECTION"=>"goals", "VALUE"=>"walk upstairs"}]]
From nodes
it's possible to grab the rest of the detail:
keys = nodes.map(&:first)
# => ["SET-TOPIC", "SETPROFILE", "SETPROFILE", "GET-RELATIONS"]
values = nodes.map(&:last)
# => [{"text"=>" INITIATE "},
# {"KEY"=>"name", "VALUE"=>"Joe"},
# {"KEY"=>"email", "VALUE"=>"[email protected]"},
# {"COLLECTION"=>"goals", "VALUE"=>"walk upstairs"}]
values[0] # => {"text"=>" INITIATE "}
If you'd rather, it's possible to pre-process the DOM and remove the top-level text:
doc.root.children.select { |n| n.is_a?(Nokogiri::XML::Text) }.map(&:remove)
doc.to_xml
# => "<root><SET-TOPIC> INITIATE </SET-TOPIC><SETPROFILE><KEY>name</KEY><VALUE>Joe</VALUE></SETPROFILE><SETPROFILE><KEY>email</KEY><VALUE>[email protected]</VALUE></SETPROFILE><GET-RELATIONS><COLLECTION>goals</COLLECTION><VALUE>walk upstairs</VALUE></GET-RELATIONS></root>\n"
That makes it easier to work with the XML.
Upvotes: 1
Reputation: 14082
Wrap the string content in a node and you can parse that with Nokogiri. The text outside the XML segment will be text node in the new node.
str = "hi there. .... Is it true?"
doc = Nokogiri::XML("<wrapper>#{str}</wrapper>")
segments = doc.xpath('/*/SETPROFILE')
Now you can use "Convert a Nokogiri document to a Ruby Hash" to convert the segments into a hash.
However, if the plain text contains some characters that needs to be escaped in the XML spec you'll need to find those and escape them yourself.
Upvotes: 0