Reputation: 47
Input:
ID | Question-How are you-Response | Question-How are you-date | Question How are you-expected response | ...... | Question-XXX-response | Question-XXX-date | question-XXX-expected-response |
---|---|---|---|---|---|---|---|
123 | value a | 17-Aug-22 | value b | ..... | value c | 15-Aug-22 | value d |
Expected output:
ID | Question | response_given | date | Expected_Response |
---|---|---|---|---|
123 | How are you | value a | 17-Aug-22 | value b |
.. | .. | .. | .. | .. |
123 | XXX | value c | 15-Aug-22 | value d |
Upvotes: 0
Views: 989
Reputation: 6644
If your column names are consistent - as in it is always delimited by hyphens and the cases/spellings are same, we can unpivot the columns to rows and extract info from the column names. This info can be pivoted to get the desired result.
# input data - maintain consistency in column names
data_sdf = spark.sparkContext.parallelize(data_ls). \
toDF(['id',
'Question-How are you-response', 'Question-How are you-date', 'Question-How are you-expected response',
'Question-XXX-response', 'Question-XXX-date', 'Question-XXX-expected response'
])
# +---+-----------------------------+-------------------------+--------------------------------------+---------------------+-----------------+------------------------------+
# | id|Question-How are you-response|Question-How are you-date|Question-How are you-expected response|Question-XXX-response|Question-XXX-date|Question-XXX-expected response|
# +---+-----------------------------+-------------------------+--------------------------------------+---------------------+-----------------+------------------------------+
# |123| value a| 17-Aug-22| value b| value c| 15-Aug-22| value d|
# +---+-----------------------------+-------------------------+--------------------------------------+---------------------+-----------------+------------------------------+
Use this detail to extract the info after unpivot. Use split
to create an array out of the column names.
unpivot_data_sdf = data_sdf. \
withColumn('allcol_struct_arr',
func.array(*[func.struct(func.lit(k).alias('key'), func.col(k).alias('val'))
for k in data_sdf.columns if k != 'id']
)
). \
selectExpr('id', 'inline(allcol_struct_arr)'). \
withColumn('key_split', func.split('key', '-')). \
withColumn('question', func.col('key_split')[1]). \
withColumn('question_field', func.col('key_split')[2])
# +---+--------------------------------------+---------+------------------------------------------+-----------+-----------------+
# |id |key |val |key_split |question |question_field |
# +---+--------------------------------------+---------+------------------------------------------+-----------+-----------------+
# |123|Question-How are you-response |value a |[Question, How are you, response] |How are you|response |
# |123|Question-How are you-date |17-Aug-22|[Question, How are you, date] |How are you|date |
# |123|Question-How are you-expected response|value b |[Question, How are you, expected response]|How are you|expected response|
# |123|Question-XXX-response |value c |[Question, XXX, response] |XXX |response |
# |123|Question-XXX-date |15-Aug-22|[Question, XXX, date] |XXX |date |
# |123|Question-XXX-expected response |value d |[Question, XXX, expected response] |XXX |expected response|
# +---+--------------------------------------+---------+------------------------------------------+-----------+-----------------+
All that's left to do is to pivot the question_field
column
unpivot_data_sdf. \
groupBy('id', 'question'). \
pivot('question_field'). \
agg(func.first('val')). \
show(truncate=False)
# +---+-----------+---------+-----------------+--------+
# |id |question |date |expected response|response|
# +---+-----------+---------+-----------------+--------+
# |123|XXX |15-Aug-22|value d |value c |
# |123|How are you|17-Aug-22|value b |value a |
# +---+-----------+---------+-----------------+--------+
Upvotes: 1
Reputation: 2033
You can use the loop and union:
# Given that n = 100
n = 100
for idx in range(1, 101):
new_df = df.select('id', func.lit(idx).alias('question'), func.col(f"question_{idx}_response").alias('response_given'), func.col(f"question_{idx}_date").alias('date'), func.col(f"question_{idx}_expected_response").alias('expected_response'))
if idx == 1:
output = new_df
else:
output = output.unionAll(new_df)
Edit: 2022-08-18
Although this is not a spark related question, based on your edit question, you have the format of the column name and you can:
column_name_list = list(set([col.split('-')[1] for col in df.columns[1:]]))
for col in column_name_list:
# same as before
....
Upvotes: 1