How can I split columns to their own row when comma-separated in column using PySpark?

Question

As the title suggests, I need to split out some columns that comma-separated. I also want to perform a carthesian product of the columns for each line.

Let's say the source data looks like this:

Id	Name	Codes_A	Codes_B
1	George	1,2	3,4
2	Mary	5,6	7,8

I would like the outcome to look something like below, where all the values in the comma-separated columns are combined to each possible outcome.

Id	Name	Codes_A	Codes_B
1	George	1	3
1	George	1	4
1	George	2	3
1	George	2	4
2	Mary	5	7
2	Mary	5	8
2	Mary	6	7
2	Mary	6	8

David דודו Markovitz · Accepted Answer

Here is the cartesian product of Codes_A & Codes_B values.
I'm assuming this is what you actually need.

Demo setup

df = spark.createDataFrame([(1,'George','1,2','3,4'),(2,'Mary','5,6','7,8')],['Id','Name','Codes_A','Codes_B'])

df.show()

+---+------+-------+-------+
| Id|  Name|Codes_A|Codes_B|
+---+------+-------+-------+
|  1|George|    1,2|    3,4|
|  2|  Mary|    5,6|    7,8|
+---+------+-------+-------+

Solution

import pyspark.sql.functions as F

df_result = (df
             .withColumn('Codes_A', F.explode(F.split('Codes_A',',')))
             .withColumn('Codes_B', F.explode(F.split('Codes_B',',')))
            )

df_result.show()

+---+------+-------+-------+
| Id|  Name|Codes_A|Codes_B|
+---+------+-------+-------+
|  1|George|      1|      3|
|  1|George|      1|      4|
|  1|George|      2|      3|
|  1|George|      2|      4|
|  2|  Mary|      5|      7|
|  2|  Mary|      5|      8|
|  2|  Mary|      6|      7|
|  2|  Mary|      6|      8|
+---+------+-------+-------+

How can I split columns to their own row when comma-separated in column using PySpark?

Answers (1)

Related Questions