Reputation: 616
As the title suggests, I need to split out some columns that comma-separated. I also want to perform a carthesian product of the columns for each line.
Let's say the source data looks like this:
Id | Name | Codes_A | Codes_B |
---|---|---|---|
1 | George | 1,2 | 3,4 |
2 | Mary | 5,6 | 7,8 |
I would like the outcome to look something like below, where all the values in the comma-separated columns are combined to each possible outcome.
Id | Name | Codes_A | Codes_B |
---|---|---|---|
1 | George | 1 | 3 |
1 | George | 1 | 4 |
1 | George | 2 | 3 |
1 | George | 2 | 4 |
2 | Mary | 5 | 7 |
2 | Mary | 5 | 8 |
2 | Mary | 6 | 7 |
2 | Mary | 6 | 8 |
Upvotes: 1
Views: 1014
Reputation: 44941
Here is the cartesian product of Codes_A & Codes_B values.
I'm assuming this is what you actually need.
Demo setup
df = spark.createDataFrame([(1,'George','1,2','3,4'),(2,'Mary','5,6','7,8')],['Id','Name','Codes_A','Codes_B'])
df.show()
+---+------+-------+-------+
| Id| Name|Codes_A|Codes_B|
+---+------+-------+-------+
| 1|George| 1,2| 3,4|
| 2| Mary| 5,6| 7,8|
+---+------+-------+-------+
Solution
import pyspark.sql.functions as F
df_result = (df
.withColumn('Codes_A', F.explode(F.split('Codes_A',',')))
.withColumn('Codes_B', F.explode(F.split('Codes_B',',')))
)
df_result.show()
+---+------+-------+-------+
| Id| Name|Codes_A|Codes_B|
+---+------+-------+-------+
| 1|George| 1| 3|
| 1|George| 1| 4|
| 1|George| 2| 3|
| 1|George| 2| 4|
| 2| Mary| 5| 7|
| 2| Mary| 5| 8|
| 2| Mary| 6| 7|
| 2| Mary| 6| 8|
+---+------+-------+-------+
Upvotes: 1