stackq
stackq

Reputation: 491

Pivot multiple columns from row to column

I have a PySpark dataframe which looks like this:

| id   | name   | policy     | payment_name | count |
|------|--------|------------|--------------|-------|
| 2    | two    | 0          | Hybrid       | 58    |
| 2    | two    | 1          | Hybrid       | 2     |
| 5    | five   | 1          | Excl         | 13    |
| 5    | five   | 0          | Excl         | 70    |
| 5    | five   | 0          | Agen         | 811   |
| 5    | five   | 1          | Agen         | 279   |
| 5    | five   | 1          | Hybrid       | 600   |
| 5    | five   | 0          | Hybrid       | 2819  |

I would like to make the combination of policy and payment_name become a column with the respective count (reducing down to one row per id).

Output would look like this:

| id | name | no_policy_hybrid | no_policy_excl | no_policy_agen | policy_hybrid | policy_excl | policy_agen |
|----|------|------------------|----------------|----------------|---------------|-------------|-------------|
| 2  | two  | 58               | 0              | 0              | 2             | 0           | 0           |
| 5  | five | 2819             | 70             | 811            | 600           | 13          | 279         |

In cases where there is no combination we can default it to 0 i.e. id 2 has no combination including payment_name Excl so it is set 0 on the example output.

Upvotes: 3

Views: 258

Answers (1)

Lasse Sviland
Lasse Sviland

Reputation: 1517

To pivot the table, you would first need a grouping column to combine the policy and the payment_name.

df = df.withColumn("groupingCol", udf("{}_{}".format)("policy", "payment_name"))

When you have that, you can group by the id and name` columns and pivot the grouping column.

df.groupBy("id", "name").pivot("groupingCol").agg(F.max("count"))

That should return the correct table columns.

+---+----+------+------+--------+------+------+--------+
| id|name|0_Agen|0_Excl|0_Hybrid|1_Agen|1_Excl|1_Hybrid|
+---+----+------+------+--------+------+------+--------+
|  5|five|   811|    70|    2819|   279|    13|     600|
|  2| two|  null|  null|      58|  null|  null|       2|
+---+----+------+------+--------+------+------+--------+

To get the same column names as in your example, you can start with changing the content of the policy column to policy and no_policy like this:

df = df.withColumn("policy", when(col("policy") == 1, "policy").otherwise("no_policy"))

This is how you would replace the missing values with 0:

df.na.fill(0)

Upvotes: 1

Related Questions