David
David

Reputation: 89

Pyspark: Unpivot a dataframe with multiple columns without hardcoding the titles

What's a good way of unpivoting a dataframe without having to hard-code the column names, I've 50+ columns that need to switch to rows. I've tried the following, given this example:

Using Python 2.7

df:

+----------+-------+-------+-------+-------+-------+
|Category  |Col_A  |Col_B  |Col_C  |Col_D  |.......|
+----------+-------+-------+-------+-------+-------+
|Category1 |1      |0      |3      |0      |.......|
+----------+-------+-------+-------+-------+-------+
|Category2 |5      |0      |2      |0      |.......|
+----------+-------+-------+-------+-------+-------+
|Category3 |0      |6      |0      |4      |.......|
+----------+-------+-------+-------+-------+-------+

Code:

Determine the column (Col_X) names:

cols = df.columns[1:]

Gets the column count and creates the format 'Col_X', Col_X for each:

x = ""
cnt = 0
for c in cols:
    cnt += 1
    x += "'"+str(c)+"', " + str(c) + ", "

Removes last two characters:

x = x[:-2]

Creates the full expression:

xpr = "\"stack(" + str(cnt) + ", " + str(x) + ") as (Type,Value)\""

Output:

df_output = df.select("Category",expr(xpr))

But it returns something like this:

df_output.show()

+----------+-------+-------+-------+-------+-------+
|Category  |stack(55, 'Col_A', Col_A, 'Col_B', Col_B....
+----------+-------+-------+-------+-------+-------+
|Category1 |
+----------+
|Category2 |
+----------+
|Category3 |
+----------+

Upvotes: 1

Views: 2122

Answers (1)

David
David

Reputation: 89

I managed to make it work by applying this on the expression buildup:

xpr = """stack({}, {}) as (Type,Value)""".format(cnt,x)

Upvotes: 1

Related Questions