Reputation: 189
I'm trying to create an Hive table out of XML file using com.ibm.spss.hive.serde2.xml.XmlSerDe. This works quite good for the tags with single occurrence. But I've an issue with multiple occurrence.
Below is my source XML.
<Item>
<TimeStamp>2016-02-19T12:27:06.387Z</TimeStamp>
<AlsoSeen End="2014-08-21T13:44:32.557Z" Start="2014-08-21T13:44:04.637Z" />
<AlsoSeen End="2014-08-21T13:44:33.557Z" Start="2014-08-21T13:45:04.637Z" />
<AlsoSeen End="2014-08-21T13:44:34.557Z" Start="2014-08-21T13:46:04.637Z" />
<Title ID="112031424">FAULT IN OUR STARS, THE</Title>
<FileName>The Fault in Our Stars (2014) EXTENDED HDRip x264 AAC-CPG</FileName>
</Item>
Below is my Hive Table DDL
add jar hivexmlserde-1.0.5.3.jar;
CREATE EXTERNAL TABLE xml_test
(
Item_TimeStamp String
,Item_AS_Start String
,Item_AS_End String
,Item_Title String
,Item_ID String
,Item_Artist String
,Item_Author String
,Item_FileName String
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.Item_TimeStamp"="/Item/TimeStamp/text()",
"column.xpath.Item_AS_Start"="/Item/AlsoSeen/@Start",
"column.xpath.Item_AS_End"="/Item/AlsoSeen/@End",
"column.xpath.Item_Title"="/Item/Title/text()",
"column.xpath.Item_ID"="/Item/Title/@ID",
"column.xpath.Item_FileName"="/Item/FileName/text()"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/xxxxxx/XML_text'
TBLPROPERTIES (
"xmlinput.start"="<Item>",
"xmlinput.end"="</Item>"
);
When I query the table I get the below value
2016-02-19T12:27:06.387Z <string>2014-08-21T13:44:04.637Z2014-08-21T13:45:04.637Z2014-08-21T13:46:04.637Z</string> <string>2014-08-21T13:44:32.557Z2014-08-21T13:44:33.557Z2014-08-21T13:44:34.557Z</string> FAULT IN OUR STARS, THE 112031424 The Fault in Our Stars (2014) EXTENDED HDRip x264 AAC-CPG
But I expected output should be 3 different lines for 3 difference occurances of AlsoSeen tag as below
2016-02-19T12:27:06.387Z 2014-08-21T13:44:04.637Z 2014-08-21T13:44:32.557Z FAULT IN OUR STARS, THE 112031424 The Fault in Our Stars (2014) EXTENDED HDRip x264 AAC-CPG
2016-02-19T12:27:06.387Z 2014-08-21T13:45:04.637Z 2014-08-21T13:44:33.557Z FAULT IN OUR STARS, THE 112031424 The Fault in Our Stars (2014) EXTENDED HDRip x264 AAC-CPG
2016-02-19T12:27:06.387Z 2014-08-21T13:46:04.637Z 2014-08-21T13:44:34.557Z FAULT IN OUR STARS, THE 112031424 The Fault in Our Stars (2014) EXTENDED HDRip x264 AAC-CPG
Can anyone please help me?
Upvotes: 0
Views: 1052
Reputation: 5415
The documentation has some examples for this. https://github.com/dvasilen/Hive-XML-SerDe/wiki/XML-data-sources
Using text()
or @Elementname
gives you single (primitive) values, and also, your table has been declared as all STRING
columns, you may need complex types to hold multiple values, i.e. Map, Struct or Array.
The behaviour you are seeing relates to (5. Complex Types) in the documentation in the link above.
"Complex content being used as a primitive type will be converted to a valid XML string by adding a root element called
<string>
"
In your example, this result you are seeing is the 3 values for the @Start attribute concatenated and then wrapped with the <string>
tag:
<string>2014-08-21T13:44:04.637Z2014-08-21T13:45:04.637Z2014-08-21T13:46:04.637Z</string>
Your XPath is using @Start
and @End
, not the /*
pattern though, so I am surprised it worked at all.
I understand the results you want, but I am not sure that it is really matching the way the data is modeled. I think that <Item>
is effectively a single row. Item has a <TimeStamp>
, and a <Title>
which has an @ID attribute, and it also has a property called <AlsoSeen>
which can have multiple values for @End & @Start. A straight mapping from XML to Hive table would probably be this complex datatype:
ARRAY<STRUCT<End: TIMESTAMP, Start: TIMESTAMP>>
, an Array of Structs containing End and Start timestamps.
Unfortunately your data doesn't conform to the String format that Hive expects for timestamps, so you can just use STRING, as per https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp
So you would end up with this table create statement (I've removed the Item_Artist and Item_Author in your original create table because you had no matching xpath for those columns in the SERDEPROPERTIES )
CREATE EXTERNAL TABLE xml_test
(
Item_TimeStamp String
,Item_AlsoSeen ARRAY<STRUCT<End: STRING, Start: STRING>>
,Item_Title String
,Item_ID String
,Item_FileName String
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.Item_TimeStamp"="/Item/TimeStamp/text()",
"column.xpath.Item_AlsoSeen"="/Item/AlsoSeen",
"column.xpath.Item_Title"="/Item/Title/text()",
"column.xpath.Item_ID"="/Item/Title/@ID",
"column.xpath.Item_FileName"="/Item/FileName/text()"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/xxxxxx/XML_text'
TBLPROPERTIES (
"xmlinput.start"="<Item>",
"xmlinput.end"="</Item>"
);
Note the xpath I used for the AlsoSeen field: "/Item/AlsoSeen"
This is the pattern for Arrays and Structs as in the documentation linked above.
It will give you results like
+--------------------+--------------------+--------------------+---------+--------------------+
| Item_TimeStamp| Item_AlsoSeen| Item_Title| Item_ID| Item_FileName|
+--------------------+--------------------+--------------------+---------+--------------------+
|2016-02-19T12:27:...|[[2014-08-21T13:4...|FAULT IN OUR STAR...|112031424|The Fault in Our ...|
+--------------------+--------------------+--------------------+---------+--------------------+
The Item_AlsoSeen column contains this, an Array () of Structs [] separated by commas, each struct has the end and the start.
WrappedArray([2014-08-21T13:44:32.557Z,2014-08-21T13:44:04.637Z], [2014-08-21T13:44:33.557Z,2014-08-21T13:45:04.637Z], [2014-08-21T13:44:34.557Z,2014-08-21T13:46:04.637Z])
From there, you can query the result set you wanted by using the LATERAL VIEW explode()
technique
e.g.
SELECT
Item_TimeStamp,
--Item_AlsoSeen,
ias.End,
ias.Start,
Item_Title,
Item_ID,
Item_FileName
FROM xml_test
LATERAL VIEW explode(Item_AlsoSeen) ias as ias
Note that because I created AlsoSeen as an Array of Struct, the explode()
function returns each struct as a row. LATERAL VIEW
then effectively does a cross join or probably better called a CROSS APPLY
to get a cartesian product.
+--------------------+--------------------+--------------------+--------------------+---------+--------------------+
| Item_TimeStamp| End| Start| Item_Title| Item_ID| Item_FileName|
+--------------------+--------------------+--------------------+--------------------+---------+--------------------+
|2016-02-19T12:27:...|2014-08-21T13:44:...|2014-08-21T13:44:...|FAULT IN OUR STAR...|112031424|The Fault in Our ...|
|2016-02-19T12:27:...|2014-08-21T13:44:...|2014-08-21T13:45:...|FAULT IN OUR STAR...|112031424|The Fault in Our ...|
|2016-02-19T12:27:...|2014-08-21T13:44:...|2014-08-21T13:46:...|FAULT IN OUR STAR...|112031424|The Fault in Our ...|
+--------------------+--------------------+--------------------+--------------------+---------+--------------------+
Ref Hive Datatypes: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types
Upvotes: 1
Reputation: 189
There are custom written UDF's which are array_index , numeric_range solves the problem easily. In order to use the functions the column type should be an Array. Please refer the post Hive Explode / Lateral View multiple arrays
Upvotes: 0