Reputation: 2601
I have an xml document as follows:
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
On hive, I am able to use xpath
to retrieve the text of each node in the XML document using the following command:
select xpath(xml_text,'//*[name()='note']//text()') from table_test;
However, I cannot figure out which XPATH command to use on Hive to retrieve the XML node names of the text
For the example above, I want to be able to return
["to","from","heading","body"]
, which represent the XML nodes for the tokens in the XML files.
Any help is appreciated.
Upvotes: 2
Views: 1161
Reputation: 1491
In general, you can only use xpath
to get either text or element attributes - not names of nodes.
So, there are two options: You could write(or find online) a custom UDF that returns the name of a node for a given xpath.
Or, you could use this hack:
select xpath(regexp_replace(xml_text,'<([\\w]+),'<$1 nodename=\'$1\' '),note/*/@nodename)
Explanation: it adds the nodename attribute to any xml element. (<to> -> <to nodename='to'>
) It may also add it in some text, but since you only extract the nodename
attribute - it shouldn't matter.
By the way, you could rewrite your text query to:
select xpath(xml_text,'note/*/text()') from table_test;
Upvotes: 2