Reputation: 189

Multiple occurance issue in XML Parsing using hive

I'm trying to create an Hive table out of XML file using com.ibm.spss.hive.serde2.xml.XmlSerDe. This works quite good for the tags with single occurrence. But I've an issue with multiple occurrence.

Below is my source XML.

<Item>
      <TimeStamp>2016-02-19T12:27:06.387Z</TimeStamp>
      <AlsoSeen End="2014-08-21T13:44:32.557Z" Start="2014-08-21T13:44:04.637Z" />
      <AlsoSeen End="2014-08-21T13:44:33.557Z" Start="2014-08-21T13:45:04.637Z" />
      <AlsoSeen End="2014-08-21T13:44:34.557Z" Start="2014-08-21T13:46:04.637Z" />
      <Title ID="112031424">FAULT IN OUR STARS, THE</Title>
      <FileName>The Fault in Our Stars (2014) EXTENDED HDRip x264 AAC-CPG</FileName>
</Item>

Below is my Hive Table DDL

add jar hivexmlserde-1.0.5.3.jar;

CREATE EXTERNAL TABLE xml_test
(
         Item_TimeStamp String
         ,Item_AS_Start  String
         ,Item_AS_End    String
         ,Item_Title     String
         ,Item_ID        String
         ,Item_Artist    String
         ,Item_Author    String
         ,Item_FileName  String
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.Item_TimeStamp"="/Item/TimeStamp/text()",
"column.xpath.Item_AS_Start"="/Item/AlsoSeen/@Start",
"column.xpath.Item_AS_End"="/Item/AlsoSeen/@End",  
"column.xpath.Item_Title"="/Item/Title/text()",   
"column.xpath.Item_ID"="/Item/Title/@ID",      
"column.xpath.Item_FileName"="/Item/FileName/text()"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/xxxxxx/XML_text'
TBLPROPERTIES (
"xmlinput.start"="<Item>",
"xmlinput.end"="</Item>"
);

When I query the table I get the below value

2016-02-19T12:27:06.387Z <string>2014-08-21T13:44:04.637Z2014-08-21T13:45:04.637Z2014-08-21T13:46:04.637Z</string> <string>2014-08-21T13:44:32.557Z2014-08-21T13:44:33.557Z2014-08-21T13:44:34.557Z</string> FAULT IN OUR STARS, THE 112031424 The Fault in Our Stars (2014) EXTENDED HDRip x264 AAC-CPG

But I expected output should be 3 different lines for 3 difference occurances of AlsoSeen tag as below

2016-02-19T12:27:06.387Z  2014-08-21T13:44:04.637Z 2014-08-21T13:44:32.557Z FAULT IN OUR STARS, THE 112031424 The Fault in Our Stars (2014) EXTENDED HDRip x264 AAC-CPG
2016-02-19T12:27:06.387Z  2014-08-21T13:45:04.637Z 2014-08-21T13:44:33.557Z FAULT IN OUR STARS, THE 112031424 The Fault in Our Stars (2014) EXTENDED HDRip x264 AAC-CPG  
2016-02-19T12:27:06.387Z  2014-08-21T13:46:04.637Z 2014-08-21T13:44:34.557Z FAULT IN OUR STARS, THE 112031424 The Fault in Our Stars (2014) EXTENDED HDRip x264 AAC-CPG

Can anyone please help me?

Upvotes: 0

Answers (2)

Davos

Reputation: 5415

The documentation has some examples for this. https://github.com/dvasilen/Hive-XML-SerDe/wiki/XML-data-sources

Using text() or @Elementname gives you single (primitive) values, and also, your table has been declared as all STRING columns, you may need complex types to hold multiple values, i.e. Map, Struct or Array.

The behaviour you are seeing relates to (5. Complex Types) in the documentation in the link above.

"Complex content being used as a primitive type will be converted to a valid XML string by adding a root element called <string>"

In your example, this result you are seeing is the 3 values for the @Start attribute concatenated and then wrapped with the <string> tag:

<string>2014-08-21T13:44:04.637Z2014-08-21T13:45:04.637Z2014-08-21T13:46:04.637Z</string>

Your XPath is using @Start and @End, not the /* pattern though, so I am surprised it worked at all.

I understand the results you want, but I am not sure that it is really matching the way the data is modeled. I think that <Item> is effectively a single row. Item has a <TimeStamp>, and a <Title> which has an @ID attribute, and it also has a property called <AlsoSeen> which can have multiple values for @End & @Start. A straight mapping from XML to Hive table would probably be this complex datatype: ARRAY<STRUCT<End: TIMESTAMP, Start: TIMESTAMP>> , an Array of Structs containing End and Start timestamps.

Unfortunately your data doesn't conform to the String format that Hive expects for timestamps, so you can just use STRING, as per https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp

So you would end up with this table create statement (I've removed the Item_Artist and Item_Author in your original create table because you had no matching xpath for those columns in the SERDEPROPERTIES )

CREATE EXTERNAL TABLE xml_test
(
         Item_TimeStamp String
         ,Item_AlsoSeen ARRAY<STRUCT<End: STRING, Start: STRING>>
         ,Item_Title     String
         ,Item_ID        String
         ,Item_FileName  String
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.Item_TimeStamp"="/Item/TimeStamp/text()",
"column.xpath.Item_AlsoSeen"="/Item/AlsoSeen", 
"column.xpath.Item_Title"="/Item/Title/text()",   
"column.xpath.Item_ID"="/Item/Title/@ID",      
"column.xpath.Item_FileName"="/Item/FileName/text()"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/xxxxxx/XML_text'
TBLPROPERTIES (
"xmlinput.start"="<Item>",
"xmlinput.end"="</Item>"
);

Note the xpath I used for the AlsoSeen field: "/Item/AlsoSeen"

This is the pattern for Arrays and Structs as in the documentation linked above.

It will give you results like

+--------------------+--------------------+--------------------+---------+--------------------+
|      Item_TimeStamp|       Item_AlsoSeen|          Item_Title|  Item_ID|       Item_FileName|
+--------------------+--------------------+--------------------+---------+--------------------+
|2016-02-19T12:27:...|[[2014-08-21T13:4...|FAULT IN OUR STAR...|112031424|The Fault in Our ...|
+--------------------+--------------------+--------------------+---------+--------------------+

The Item_AlsoSeen column contains this, an Array () of Structs [] separated by commas, each struct has the end and the start.

WrappedArray([2014-08-21T13:44:32.557Z,2014-08-21T13:44:04.637Z], [2014-08-21T13:44:33.557Z,2014-08-21T13:45:04.637Z], [2014-08-21T13:44:34.557Z,2014-08-21T13:46:04.637Z])

From there, you can query the result set you wanted by using the LATERAL VIEW explode() technique

e.g.

SELECT
Item_TimeStamp,
    --Item_AlsoSeen,
    ias.End,
    ias.Start,
    Item_Title, 
    Item_ID, 
    Item_FileName
FROM xml_test
    LATERAL VIEW explode(Item_AlsoSeen) ias as ias

Note that because I created AlsoSeen as an Array of Struct, the explode() function returns each struct as a row. LATERAL VIEW then effectively does a cross join or probably better called a CROSS APPLY to get a cartesian product.

+--------------------+--------------------+--------------------+--------------------+---------+--------------------+
|      Item_TimeStamp|                 End|               Start|          Item_Title|  Item_ID|       Item_FileName|
+--------------------+--------------------+--------------------+--------------------+---------+--------------------+
|2016-02-19T12:27:...|2014-08-21T13:44:...|2014-08-21T13:44:...|FAULT IN OUR STAR...|112031424|The Fault in Our ...|
|2016-02-19T12:27:...|2014-08-21T13:44:...|2014-08-21T13:45:...|FAULT IN OUR STAR...|112031424|The Fault in Our ...|
|2016-02-19T12:27:...|2014-08-21T13:44:...|2014-08-21T13:46:...|FAULT IN OUR STAR...|112031424|The Fault in Our ...|
+--------------------+--------------------+--------------------+--------------------+---------+--------------------+

Ref Hive Datatypes: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

Upvotes: 1

Sathyaraj

Reputation: 189

There are custom written UDF's which are array_index , numeric_range solves the problem easily. In order to use the functions the column type should be an Array. Please refer the post Hive Explode / Lateral View multiple arrays

Upvotes: 0

Multiple occurance issue in XML Parsing using hive

Answers (2)

Related Questions