yonabout
yonabout

Reputation: 67

Using Power Query to transform multiple columns of question and answer data into a model for power bi

I've got a csv extract of data from a system in this kind of format (but with 16 question, answer and score groups - not just 4):

ClientID MemberID DateCompleted Overall Score Category1 Summary Category2 Summary Category3 Summary Question1 Answer1 Question2 Answer2 Score 2 Question3 Answer3 Score 3 Question4 Answer4 Score 4
Client1 [email protected] 01/01/2023 12:00:00 100 100 100 100 How Old Are You? 25 What's your favourite colour? Blue 5 How long have you owned your house? 4 Years 4 Can you Sail? Yes 1
Client1 [email protected] 01/02/2023 12:00:00 36 61 20 27 How Old Are You? 30 What's your favourite colour? Green 4 How long have you owned your house? 10 Years 6 Can you Sail? No 0
Client2 [email protected] 01/01/2023 12:00:00 63 83 60 47 How Old Are You? 50 What's your favourite colour? Yellow 3 How long have you owned your house? Over 20 Years 10 Can you Sail? Yes 1
Client2 [email protected] 01/02/2023 13:00:00 62 80 60 47 How Old Are You? 50 What's your favourite colour? Yellow 3 How long have you owned your house? Over 20 Years 10 Can you Sail? Yes 1

It's questionnaire data and I haven't got any say over the format. One quirk is that MemberID is the unique identifier for a person, and a person can complete the questionnaire multiple times (so MemberID and DateCompleted make a row unique). That also means that a unique member could have a different age recorded on different rows if they've done the questionnaire years apart. Question 1 will always be 'how old are you', Question 2 will always be 'What's your favourite colour' etc.

So. I've been asked to get it into power bi and report on it, including Age ranges (e.g. less than 30 / 30-49 / over 50) I've got a solution using power query that's got it into a reportable model (outlined below), but it feels a bit clunky and I'm sure there's a neater / more efficient way of doing it.

I want to transform it into this model: Power Bi Model

with the tables looking like this:

Questions:

QuestionID Question
1 How old are you?
2 What's your favourite colour?
3 How long have you owned your house?
4 Can you Sail?

Member:

ClientID MemberID Age AgeRange1 AgeRangeOrder1 AgeRange2 AgeRangeOrder2
Client1 [email protected] 25 Under 30 1 20-30 1
Client1 [email protected] 30 30-50 2 30-40 2
Client2 [email protected] 50 over 50 3 50-60 4
Client2 [email protected] 50 over 50 3 50-60 4

AnswerHistory:

MemberID MemberDatekey DateCompleted QuestionID Answer Score
[email protected] [email protected]/01/23 12:00:00 01/01/2023 12:00 1 25 0
[email protected] [email protected]/01/23 12:00:00 01/01/2023 12:00 2 Blue 5
[email protected] [email protected]/01/23 12:00:00 01/01/2023 12:00 3 4 Years 4
[email protected] [email protected]/01/23 12:00:00 01/01/2023 12:00 4 Yes 1
[email protected] [email protected]/02/23 12:00:00 01/02/2023 12:00 1 30 0
[email protected] [email protected]/02/23 12:00:00 01/02/2023 12:00 2 Green 4
[email protected] [email protected]/02/23 12:00:00 01/02/2023 12:00 3 10 Years 6
[email protected] [email protected]/02/23 12:00:00 01/02/2023 12:00 4 No 0
[email protected] [email protected]/01/23 12:00:00 01/01/2023 12:00 1 50 0
[email protected] [email protected]/01/23 12:00:00 01/01/2023 12:00 2 Yellow 3
[email protected] [email protected]/01/23 12:00:00 01/01/2023 12:00 3 Over 20 Years 10
[email protected] [email protected]/01/23 12:00:00 1/01/2023 12:00 4 No 1
[email protected] [email protected]/02/23 13:00:00 01/02/2023 13:00 1 50 0
[email protected] [email protected]/02/23 13:00:00 01/02/2023 13:00 2 Yellow 3
[email protected] [email protected]/02/23 13:00:00 01/02/2023 13:00 3 Over 20 Years 10
[email protected] [email protected]/02/23 13:00:00 01/02/2023 13:00 4 Yes 1

MemberSummaries:

memberid memberdatekey datecompleted overall score Category1 Summary Category2 Summary Category3 Summary
[email protected] [email protected]/01/23 12:00:00 01/01/2023 100 100 100 100
[email protected] [email protected]/02/23 12:00:00 01/02/2023 36 61 20 27
[email protected] [email protected]/01/23 12:00:00 01/01/2023 63 83 60 47
[email protected] [email protected]/02/23 13:00:00 01/02/2023 62 80 60 47

And the steps I took to get there in power query are:

  1. Create a new 'questions' table with a 'QuestionID' and a Question. Hardcoded this in from the questions but would like to make this dynamic.

  2. Duplicate the Main query, remove all the columns except memberid, date and Question1 / Answer 1

  3. Insert a 'QuestionID column' and fill it with "1"

  4. Repeat steps 2 & 3 for the other 15 questions so I end up with 16 queries (Q1 - Q16) with memberid, date, questionid, answer & score.

  5. Append Q1 - Q16 into a new query - "AnswerHistory"

  6. Merge memberid & date to get a unique key on AnswerHistory

  7. Duplicate the main query, remove all the columns except member identifiers, date and the the summary scores - new Query -"MemberSummaries"

  8. Merge memberID & date to get a unique key on MemberSummaries

  9. Duplicate the main query, remove everything except the member identifiers, client and age.

  10. Group to remove duplicates - new Query - "Member"

  11. make some new columns to split by various age groups **

    **I just realised I probably need the age stuff on MemberSummaries because it could change over time & if it does then I won't get unique members.

So this looks like a lot of steps

List of queries

Having loads of queries to split up the questions & re-merge them into something sensible (to me) seems like overkill.

Is there a way to pivot the original data into the "AnswerHistory" format in one query rather than in 16 separate ones?

Cheers

Upvotes: 0

Views: 919

Answers (1)

Ron Rosenfeld
Ron Rosenfeld

Reputation: 60344

Edited to include the custom function
You can generate the "AnswerHistory" table from your original table with just a single query.

Algorithm should be explained by the notes in the query

let

//read in the initial data table
    Source = Excel.CurrentWorkbook(){[Name="Table8"]}[Content],

//Remove unneeded columns
    #"Cols to Remove" = List.Select(Table.ColumnNames(Source), each _ = "Overall Score" or _ = "ClientID" or Text.EndsWith(_,"Summary")),
    #"Remove Cols" = Table.RemoveColumns(Source,#"Cols to Remove"),

//Add the memberdatekey column
//   This is not needed to develop this particular table
//and set it into the proper position
    #"Add MemberDatekey" = Table.AddColumn(#"Remove Cols", "MemberDatekey", 
        each [MemberID] & DateTime.ToText([DateCompleted],"MM/dd/yyyy" & "#(lf)" & "hh:mm"), type text),
    #"Reordered Columns" = Table.ReorderColumns(#"Add MemberDatekey", {"MemberID", "MemberDatekey"} & List.RemoveFirstN(Table.ColumnNames(#"Remove Cols"),1)),

//Unpivot columns except for the first three
    #"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Reordered Columns", {"MemberID", "MemberDatekey", "DateCompleted"}, "Attribute", "Value"),

//Split column contents so as to create unique values to pivot on
//Note renaming of the split columns in the function from the default
    #"Split Column by Character Transition" = Table.SplitColumn(#"Unpivoted Other Columns", "Attribute", 
        Splitter.SplitTextByCharacterTransition((c) => not List.Contains({"0".."9"}, c), {"0".."9"}), {"Attribute", "QuestionID"}),

//Using a custom function, Pivot the "Attribute" column
//  custom function needed because the basic Pivot with no aggregation function will return an error.
//  This can also be done using native functions, if the custom function is too slow
    #"Pivot No Aggregation" = fnPivotNoAggregation(#"Split Column by Character Transition","Attribute","Value"),

//Remove the Question text column as it is no longer needed
    #"Removed Columns" = Table.RemoveColumns(#"Pivot No Aggregation",{"Question"}),

//Set the data types for the columns
    #"Changed Type" = Table.TransformColumnTypes(#"Removed Columns",{{"MemberID", type text}, {"MemberDatekey", type text}, {"DateCompleted", type datetime}, {"QuestionID", Int64.Type}, {"Answer", type any}, {"Score ", Int64.Type}}),

//Replace the null score generated for "age" by 0 as per your spec
    #"Replaced Value" = Table.ReplaceValue(#"Changed Type",null,0,Replacer.ReplaceValue,{"Score "})
in
    #"Replaced Value"

Custom Function
Enter as Blank Query and Rename: fnPivotNoAggregation

//credit: Cam Wallace  https://www.dingbatdata.com/2018/03/08/non-aggregate-pivot-with-multiple-rows-in-powerquery/

//Rename:  fnPivotNoAggregation

(Source as table,
    ColToPivot as text,
    ColForValues as text)=> 

let
     PivotColNames = List.Buffer(List.Distinct(Table.Column(Source,ColToPivot))),
     #"Pivoted Column" = Table.Pivot(Source, PivotColNames, ColToPivot, ColForValues, each _),
 
    TableFromRecordOfLists = (rec as record, fieldnames as list) =>
    
    let
        PartialRecord = Record.SelectFields(rec,fieldnames),
        RecordToList = Record.ToList(PartialRecord),
        Table = Table.FromColumns(RecordToList,fieldnames)
    in
        Table,
 
    #"Added Custom" = Table.AddColumn(#"Pivoted Column", "Values", each TableFromRecordOfLists(_,PivotColNames)),
    #"Removed Other Columns" = Table.RemoveColumns(#"Added Custom",PivotColNames),
    #"Expanded Values" = Table.ExpandTableColumn(#"Removed Other Columns", "Values", PivotColNames)
in
    #"Expanded Values"

If you desire the clients to come out in a specific order, some sorting steps may need to be added

enter image description here

Upvotes: 0

Related Questions