Reputation: 215
I have a XML blob (as shown below) store in a hive log table.
<user>
<uid>1424324325</uid>
<attribs>
<field>
...
</field>
<field>
<name>first</name>
<value>John</value>
</field>
<field>
...
</field>
<field>
<name>last</name>
<value>Doe</value>
</field>
<field>
...
</field>
</attribs>
</user>
Each row in the hive table would have information about different users and I want to extract the values of uid, first name and last name.
1424324325 John Doe
1424435463 Jane Smith
Extracting uid value is pretty simple. However, I am getting stumped trying to extract the first name & last name. The problem is about identifying the first name & last name pair and extracting the value.
I tried to extract the first name & last name as shown below but I got an error saying this is an invalid expression.
SELECT uid, fn, ln
FROM log_table
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/uid/text()')) uids as uid
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field/name/text()="first"/../value/text()')) fns as fn
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field/name/text()="last"/../value/text()')) lns as ln;
I thought of using a hard-coded expression for the field node as shown below but the problem is that different records will have the first name and last name values in different locations.
LATERAL VIEW explode(xpath(logs['users_updates'], '/user/attribs/field[5]/value/text()')) fns as fn
When I try to extract first name as shown below, the result is empty.
LATERAL VIEW explode(xpath(logs['users_updates'], '/users/attribs/field/name/[text()="last"]/following-sibling::value[1]/text()')) fns as fn
How can I extract the information I want like below?
1424324325 John Doe
1424435463 Jane Smith
Thanks in advance.
Upvotes: 0
Views: 319
Reputation: 6218
The following XPath should give you the correct result. Your syntax was incorrect (a predicate (i.e. everything in brackets) needs to be with an element, but you just did a child step using /
).
/users/attribs/field[name = "first"]/value/string()
Furthermore, some points to improve your query:
text()
when comparing node values, it will be done automatically and it is quite likely fasterstring()
over text()
Upvotes: 1