Reputation: 491
I have a PySpark dataframe which looks like this:
| id | name | policy | payment_name | count |
|------|--------|------------|--------------|-------|
| 2 | two | 0 | Hybrid | 58 |
| 2 | two | 1 | Hybrid | 2 |
| 5 | five | 1 | Excl | 13 |
| 5 | five | 0 | Excl | 70 |
| 5 | five | 0 | Agen | 811 |
| 5 | five | 1 | Agen | 279 |
| 5 | five | 1 | Hybrid | 600 |
| 5 | five | 0 | Hybrid | 2819 |
I would like to make the combination of policy
and payment_name
become a column with the respective count (reducing down to one row per id
).
Output would look like this:
| id | name | no_policy_hybrid | no_policy_excl | no_policy_agen | policy_hybrid | policy_excl | policy_agen |
|----|------|------------------|----------------|----------------|---------------|-------------|-------------|
| 2 | two | 58 | 0 | 0 | 2 | 0 | 0 |
| 5 | five | 2819 | 70 | 811 | 600 | 13 | 279 |
In cases where there is no combination we can default it to 0 i.e. id
2 has no combination including payment_name
Excl so it is set 0 on the example output.
Upvotes: 3
Views: 258
Reputation: 1517
To pivot the table, you would first need a grouping column to combine the policy
and the payment_name
.
df = df.withColumn("groupingCol", udf("{}_{}".format)("policy", "payment_name"))
When you have that, you can group by the id and
name` columns and pivot the grouping column.
df.groupBy("id", "name").pivot("groupingCol").agg(F.max("count"))
That should return the correct table columns.
+---+----+------+------+--------+------+------+--------+
| id|name|0_Agen|0_Excl|0_Hybrid|1_Agen|1_Excl|1_Hybrid|
+---+----+------+------+--------+------+------+--------+
| 5|five| 811| 70| 2819| 279| 13| 600|
| 2| two| null| null| 58| null| null| 2|
+---+----+------+------+--------+------+------+--------+
To get the same column names as in your example, you can start with changing the content of the policy column to policy
and no_policy
like this:
df = df.withColumn("policy", when(col("policy") == 1, "policy").otherwise("no_policy"))
This is how you would replace the missing values with 0
:
df.na.fill(0)
Upvotes: 1