Talisphere
Talisphere

Reputation: 65

How to convert key/value pairs into columns?

AMENDMENT:

I apologize, in trying to simplify my question I failed to mention that there are a bunch of static columns. In fact the majority of the XML is specific tags, and the (Dynamic) Key/Value pair are just a way for developers to extend the schema. Unfortunately, this means that I cannot pivot the entire table!

To demonstrate this, consider the following input:

<?xml version="1.0"?>
<list>
    <value type="object">
        <customer>186748</customer>
        <create_date>2013-01-22</create_date>
        <tag0 name="Last_Name" type="object">
            <value>Smith</value>
        </tag0>
        <tag1 name="First_Name" type="object">
            <value>Wendy</value>
        </tag1>
        <tag2 name="Company" type="object">
            <value>ACME Inc</value>
        </tag2>
   </value>
    <value type="object">
        <customer>256238</customer>
        <create_date>2013-01-22</create_date>
        <tag0 name="First_Name" type="object">
            <value>Bob</value>
        </tag0>
        <tag1 name="Company" type="object">
            <value>ABC Corp</value>
        </tag1>
   </value>
    <value type="object">
        <customer>301654</customer>
        <create_date>2013-01-22</create_date>
        <tag0 name="Company" type="object">
            <value>Everything Co</value>
        </tag0>
   </value>
</list>

Should produce the following output:

enter image description here

I am looking for an approach to solving this type of problem. I am now wondering if I can just pivot part of a table using a custom function.

ORIGINAL POST:

I am looking for a suggested approach in PowerQuery to create columns from dynamic Key/Value pairs. My data source is an XML document with a bunch of generic elements (e.g. tag0, tag1, ..., tagN) that holds a Key (in "name" attribute) and a Value (in "value" element), for example:

<?xml version="1.0"?>
<list>
    <value type="object">
        <tag0 name="Last_Name" type="object">
            <value>Smith</value>
        </tag0>
        <tag1 name="First_Name" type="object">
            <value>Wendy</value>
        </tag1>
        <tag2 name="Company" type="object">
            <value>ACME Inc</value>
        </tag2>
   </value>
    <value type="object">
        <tag0 name="First_Name" type="object">
            <value>Bob</value>
        </tag0>
        <tag1 name="Company" type="object">
            <value>ABC Corp</value>
        </tag1>
   </value>
    <value type="object">
        <tag0 name="Company" type="object">
            <value>Everything Co</value>
        </tag0>
   </value>
</list>

What I want to do is use the Key to create a table column, and assign it the Value (or null, if it doesn't exist). Ideally, the output would look like:

enter image description here

When I load this into PowerQuery, it looks like the following:

enter image description here

My first thought was to expand each of the generic table columns (e.g. tag0, tag1, etc.) to hold the Key and Value, then separate the columns by Key, then merge all of the Keys across the expanded generic tags. By this I mean, start by expanding "tag0", create separate conditional columns for each Key type (e.g "Company0", "First_Name0", "Last_Name0", do the same for "tag1", "tag2" and "tag3", then merge "Company0" (from tag0), "Company1" (from tag1), ..., "Company3" (from tag3), and do the same for "First_Name" and "Last_Name". Trouble with this approach is it is a whole lot effort that does not scale well if another Key is added. Using this approach there are 4 generic tags, with (currently) 5 possible Keys, which means 2 columns for each generic tag for the expansion, plus 20 conditional columns, plus 5 columns for the merge.

The Question: Is there a better approach for converting one to many Key/Value pairs into columns?

Upvotes: 2

Views: 1380

Answers (2)

horseyride
horseyride

Reputation: 21318

What the heck. Editing answer for revised question

let Source = Xml.Tables(File.Contents("C:\temp\a.xml")),
Table0 = Source{0}[Table],
Expanded=ExpandAll(Table0),
shrunk= Table.RemoveColumns(Expanded,List.Select(Table.ColumnNames(Expanded),each Text.Contains(_,"type"))),
#"Added Index" = Table.AddIndexColumn(shrunk, "Index", 0, 1, Int64.Type),
Base=({"Index","customer","create_date"}), //names of columns to keep
repeating_groups=2,  // repeating groups of xx columns
Combo = List.Transform(List.Split(List.Difference(Table.ColumnNames(#"Added Index"),Base),repeating_groups), each Base & _),
#"Added Custom" =List.Accumulate(Combo,#table({"Column1"}, {}),(state,current)=> state & Table.Skip(Table.DemoteHeaders(Table.SelectColumns(#"Added Index", current)),1)),
#"Rename"=Table.RenameColumns(#"Added Custom",List.Zip({List.FirstN(Table.ColumnNames(#"Added Custom"),List.Count(Base)),Base}))            ,
#"Filtered Rows" = Table.SelectRows(#"Rename", each Record.Field(_,List.Last(Table.ColumnNames(#"Rename"))) <> null),
#"Pivoted Column" = Table.Pivot(#"Filtered Rows", List.Distinct(#"Filtered Rows"[Column5]), "Column5", "Column4")
in #"Pivoted Column"


// fn = ExpandAll
(TableToExpand as table, optional ColumnNumber as number) =>
//https://blog.crossjoin.co.uk/2014/05/21/expanding-all-columns-in-a-table-in-power-query/
let ActualColumnNumber = if (ColumnNumber=null) then 0 else ColumnNumber,
ColumnName = Table.ColumnNames(TableToExpand){ActualColumnNumber},
ColumnContents = Table.Column(TableToExpand, ColumnName),
ColumnsToExpand = List.Distinct(List.Combine(List.Transform(ColumnContents, each if _ is table then Table.ColumnNames(_) else {}))),
NewColumnNames = List.Transform(ColumnsToExpand, each ColumnName & "." & _),
CanExpandCurrentColumn = List.Count(ColumnsToExpand)>0,
ExpandedTable = if CanExpandCurrentColumn then Table.ExpandTableColumn(TableToExpand, ColumnName, ColumnsToExpand, NewColumnNames) else TableToExpand,
NextColumnNumber = if CanExpandCurrentColumn then ActualColumnNumber else ActualColumnNumber+1,
OutputTable = if NextColumnNumber>(Table.ColumnCount(ExpandedTable)-1) then ExpandedTable else ExpandAll(ExpandedTable, NextColumnNumber)
in OutputTable

Upvotes: 2

Ron Rosenfeld
Ron Rosenfeld

Reputation: 60224

Edited to account for amended sample

The below seems to work with your sample data.

Read the code comments and explore the Applied Steps to understand the algorithm.

The approach is to expand and then extract the Values into one column and the corresponding attributes into another column.

By then Pivoting (with no aggregation) on the Attributes column, PQ will create a new column for each attribute.

No custom function required in this version

Main M Code

let

//change next line to reflect actual data source
    Source = Xml.Document(File.Contents("C:\Users\ron\Desktop\Sample XML.xml")),

//Expand as needed
    #"Removed Columns" = Table.RemoveColumns(Source,{"Name", "Namespace", "Attributes"}),
    #"Expanded Value" = Table.ExpandTableColumn(#"Removed Columns", "Value", {"Value"}, {"Value.1"}),
    #"Expanded Value.1" = Table.ExpandTableColumn(#"Expanded Value", "Value.1", {"Name", "Value", "Attributes"}, {"Name", "Value", "Attributes"}),

//Extract Attributes and Values to columns
//Extraction method depends on level of object
    #"Added Custom" = Table.AddColumn(#"Expanded Value.1", "Name.1", each if Value.Is([Value],type table) then [Value]{0}[Value] else [Value], type text),
    #"Added Custom1" = Table.AddColumn(#"Added Custom", "Attribute", each if Value.Is([Value],type table) then [Attributes]{0}[Value] else [Name]),

//remove original columns
    #"Removed Columns1" = Table.RemoveColumns(#"Added Custom1",{"Name", "Value", "Attributes"}),

//create "Grouper Column" to group each set of data
//Assumes first entry of each "group" is "customer"
    #"Added Index" = Table.AddIndexColumn(#"Removed Columns1", "Index", 0, 1, Int64.Type),
    #"Added Custom2" = Table.AddColumn(#"Added Index", "Grouper", each if [Attribute]="customer" then [Index] else null),
    #"Removed Columns2" = Table.RemoveColumns(#"Added Custom2",{"Index"}),
    #"Filled Down" = Table.FillDown(#"Removed Columns2",{"Grouper"}),

//Group by Grouper, then Pivot each sub table
    #"Grouped Rows" = Table.Group(#"Filled Down", {"Grouper"}, {
        {"Pivot", each Table.Pivot(Table.RemoveColumns(_, "Grouper"), List.Distinct([Attribute]), "Attribute", "Name.1")}}),
    #"Removed Columns3" = Table.RemoveColumns(#"Grouped Rows",{"Grouper"}),

//Expand the pivoted tables and set the correct column order and data types
    #"Expanded Pivot" = Table.ExpandTableColumn(#"Removed Columns3", "Pivot", {"customer", "create_date", "Last_Name", "First_Name", "Company"}),
    #"Reorder Columns" = Table.ReorderColumns(#"Expanded Pivot", {"customer", "create_date", "First_Name", "Last_Name",  "Company"}),
    #"Set Data Types" = Table.TransformColumnTypes(#"Reorder Columns",
    List.Zip({Table.ColumnNames(#"Reorder Columns"),{type text,type date,type text,type text,type text}}))
in
    #"Set Data Types"

Results from amended question
enter image description here

Upvotes: 3

Related Questions