pradeep nadarajan
pradeep nadarajan

Reputation: 47

How to dynamically pivot/transpose columns in pyspark dataframe

Input:

ID Question-How are you-Response Question-How are you-date Question How are you-expected response ...... Question-XXX-response Question-XXX-date question-XXX-expected-response
123 value a 17-Aug-22 value b ..... value c 15-Aug-22 value d

Expected output:

ID Question response_given date Expected_Response
123 How are you value a 17-Aug-22 value b
.. .. .. .. ..
123 XXX value c 15-Aug-22 value d

Upvotes: 0

Views: 989

Answers (2)

samkart
samkart

Reputation: 6644

If your column names are consistent - as in it is always delimited by hyphens and the cases/spellings are same, we can unpivot the columns to rows and extract info from the column names. This info can be pivoted to get the desired result.

# input data - maintain consistency in column names
data_sdf = spark.sparkContext.parallelize(data_ls). \
    toDF(['id', 
          'Question-How are you-response', 'Question-How are you-date', 'Question-How are you-expected response',
          'Question-XXX-response', 'Question-XXX-date', 'Question-XXX-expected response'
          ])

# +---+-----------------------------+-------------------------+--------------------------------------+---------------------+-----------------+------------------------------+
# | id|Question-How are you-response|Question-How are you-date|Question-How are you-expected response|Question-XXX-response|Question-XXX-date|Question-XXX-expected response|
# +---+-----------------------------+-------------------------+--------------------------------------+---------------------+-----------------+------------------------------+
# |123|                      value a|                17-Aug-22|                               value b|              value c|        15-Aug-22|                       value d|
# +---+-----------------------------+-------------------------+--------------------------------------+---------------------+-----------------+------------------------------+

Use this detail to extract the info after unpivot. Use split to create an array out of the column names.

unpivot_data_sdf = data_sdf. \
    withColumn('allcol_struct_arr', 
               func.array(*[func.struct(func.lit(k).alias('key'), func.col(k).alias('val')) 
                            for k in data_sdf.columns if k != 'id']
                          )
               ). \
    selectExpr('id', 'inline(allcol_struct_arr)'). \
    withColumn('key_split', func.split('key', '-')). \
    withColumn('question', func.col('key_split')[1]). \
    withColumn('question_field', func.col('key_split')[2])

# +---+--------------------------------------+---------+------------------------------------------+-----------+-----------------+
# |id |key                                   |val      |key_split                                 |question   |question_field   |
# +---+--------------------------------------+---------+------------------------------------------+-----------+-----------------+
# |123|Question-How are you-response         |value a  |[Question, How are you, response]         |How are you|response         |
# |123|Question-How are you-date             |17-Aug-22|[Question, How are you, date]             |How are you|date             |
# |123|Question-How are you-expected response|value b  |[Question, How are you, expected response]|How are you|expected response|
# |123|Question-XXX-response                 |value c  |[Question, XXX, response]                 |XXX        |response         |
# |123|Question-XXX-date                     |15-Aug-22|[Question, XXX, date]                     |XXX        |date             |
# |123|Question-XXX-expected response        |value d  |[Question, XXX, expected response]        |XXX        |expected response|
# +---+--------------------------------------+---------+------------------------------------------+-----------+-----------------+

All that's left to do is to pivot the question_field column

unpivot_data_sdf. \
    groupBy('id', 'question'). \
    pivot('question_field'). \
    agg(func.first('val')). \
    show(truncate=False)

# +---+-----------+---------+-----------------+--------+
# |id |question   |date     |expected response|response|
# +---+-----------+---------+-----------------+--------+
# |123|XXX        |15-Aug-22|value d          |value c |
# |123|How are you|17-Aug-22|value b          |value a |
# +---+-----------+---------+-----------------+--------+

Upvotes: 1

Jonathan
Jonathan

Reputation: 2033

You can use the loop and union:

# Given that n = 100

n = 100

for idx in range(1, 101):
    new_df = df.select('id', func.lit(idx).alias('question'), func.col(f"question_{idx}_response").alias('response_given'), func.col(f"question_{idx}_date").alias('date'), func.col(f"question_{idx}_expected_response").alias('expected_response'))
    if idx == 1:
        output = new_df
    else:
        output = output.unionAll(new_df)

Edit: 2022-08-18

Although this is not a spark related question, based on your edit question, you have the format of the column name and you can:

column_name_list = list(set([col.split('-')[1] for col in df.columns[1:]]))

for col in column_name_list:
    # same as before
    ....

Upvotes: 1

Related Questions