Don Gorgon
Don Gorgon

Reputation: 2601

Get the names of XML nodes using XPATH on Hive

I have an xml document as follows:

<note>
   <to>Tove</to>
   <from>Jani</from>
   <heading>Reminder</heading>
   <body>Don't forget me this weekend!</body>
</note>

On hive, I am able to use xpath to retrieve the text of each node in the XML document using the following command:

select xpath(xml_text,'//*[name()='note']//text()') from table_test;

However, I cannot figure out which XPATH command to use on Hive to retrieve the XML node names of the text For the example above, I want to be able to return ["to","from","heading","body"], which represent the XML nodes for the tokens in the XML files. Any help is appreciated.

Upvotes: 2

Views: 1161

Answers (1)

Alex Libov
Alex Libov

Reputation: 1491

In general, you can only use xpath to get either text or element attributes - not names of nodes.

So, there are two options: You could write(or find online) a custom UDF that returns the name of a node for a given xpath.

Or, you could use this hack:

select xpath(regexp_replace(xml_text,'<([\\w]+),'<$1 nodename=\'$1\' '),note/*/@nodename)

Explanation: it adds the nodename attribute to any xml element. (<to> -> <to nodename='to'>) It may also add it in some text, but since you only extract the nodename attribute - it shouldn't matter.

By the way, you could rewrite your text query to:

select xpath(xml_text,'note/*/text()') from table_test;

Upvotes: 2

Related Questions