Nandakumar_bigdata
Nandakumar_bigdata

Reputation: 5

Loading XML file to Hive tables

We are working on loading a semi-structured XML file in Hive tables. These are retail purchase data.
I have attached a sample XML file to understand how the data looks and also the Hive table definition which i am using to read that XML file.

The file has multiple Payment and Amount(Amt) sections under "Tender" for each Basket in the XML. For example a customer can pay by Cash,EFTPOS,Credit card,Loyalty card or sometimes use a combination of these.

While reading the data from the attached XML , the results show the Payment column merged with entries in the XML and for 'Amt' it shows NULL.

I understood that the table which I am using to read the data is NOT 100% correct.
Please let me know how i can create multiple records for same element names "Payment" and "Amt"using the xpath functions

I tried to find some documentation in internet for the same and couldn't find anything which is similar to my scenario.

Sample data:

<Bskt>
  <TillNo>4</TillNo>
  <BsktNo>1747</BsktNo>
  <DateTime>2017-10-31T10:51:25.000+11:00</DateTime>
  <OpID>10115</OpID>
  <Tender>
    <PayType>CSH</PayType>
    <Amt>46.75</Amt>
  </Tender>
  <Tender>
    <PayType>ITMLOY</PayType>
    <Amt>0</Amt>
    <CardNo>2679911927</CardNo>
    <Program>SmartRewards</Program>
    <Earn>46.00</Earn>
    <Burn>0.00</Burn>
  </Tender>
</Bskt>
<Bskt>
  <TillNo>4</TillNo>
  <BsktNo>1748</BsktNo>
  <DateTime>2017-10-31T10:53:11.000+11:00</DateTime>
  <OpID>10115</OpID>
  <Tender>
    <PayType>CSH</PayType>
    <Amt>46.75</Amt>
  </Tender>
  <Tender>
    <PayType>ITMLOY</PayType>
    <Amt>0</Amt>
    <CardNo>2619183833</CardNo>
    <Program>SmartRewards</Program>
    <Earn>46.00</Earn>
    <Burn>0.00</Burn>
  </Tender>
</Bskt>
<Bskt>
  <TillNo>4</TillNo>
  <BsktNo>1753</BsktNo>
  <DateTime>2017-10-31T11:19:34.000+11:00</DateTime>
  <OpID>50056</OpID>
  <Tender>
    <PayType>CSH</PayType>
    <Amt>28.10</Amt>
  </Tender>
  <Tender>
    <PayType>ITMLOY</PayType>
    <Amt>0</Amt>
    <CardNo>8263734549</CardNo>
    <Program>SmartRewards</Program>
    <Earn>28.00</Earn>
    <Burn>0.00</Burn>
  </Tender>
</Bskt>

Hive table:

CREATE EXTERNAL TABLE BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType` string,
`Amt` float,
`CardNo` string,
`Program` string,
`Earn` float,
`Burn` float
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/Paytype/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
    OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
    LOCATION '<hdfs file location>'
    TBLPROPERTIES (
    "xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);

Hive query output:

select * from BASKET_TENDER

Query result

Upvotes: 0

Views: 5178

Answers (1)

Gaurang Shah
Gaurang Shah

Reputation: 12920

GET only one Field

You can use subscript operator ([]) to parse xml. However, please make sure index starts with 1 and not 0.

In your case, I assume you want second instance of tender. Then just use following xml path.

"column.xpath.PayType"="/Bskt/Tender[2]/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender[2]/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender[2]/Amt/text()",

This will give you following values.

enter image description here

Get Both the fields

if you want to get both the fields as part of array, you need to define those field as array and do not provide any subscript operator when selecting xpath, as mentioned below

drop table temp.BASKET_TENDER;
CREATE EXTERNAL TABLE temp.BASKET_TENDER (
`DateTime` string,
`BsktNo` double,
`TillNo` int,
`PayType`  array<String>,
`Amt` array<float>,
`CardNo` array<string>,
`Program` string,
`Earn` float,
`Burn` float
)

ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (

"column.xpath.DateTime"="/Bskt/DateTime/text()",
"column.xpath.BsktNo"="/Bskt/BsktNo/text()",
"column.xpath.TillNo"="/Bskt/TillNo/text()",
"column.xpath.PayType"="/Bskt/Tender/PayType/text()",
"column.xpath.CardNo"="/Bskt/Tender/CardNo/text()",
"column.xpath.Amt"="/Bskt/Tender/Amt/text()",
"column.xpath.Program"="/Bskt/Tender/Program/text()",
"column.xpath.Earn"="/Bskt/Tender/Earn/text()",
"column.xpath.Burn"="/Bskt/Tender/Burn/text()"
)

STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
    OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' 
    LOCATION '/tmp/BASKET_TENDER'
    TBLPROPERTIES (
    "xmlinput.start"="<Bskt","xmlinput.end"="</Bskt>"
);

select * from temp.BASKET_TENDER;

The output would be as mentioned blow

enter image description here

Upvotes: 1

Related Questions