Fjurg
Fjurg

Reputation: 616

How can I split columns to their own row when comma-separated in column using PySpark?

As the title suggests, I need to split out some columns that comma-separated. I also want to perform a carthesian product of the columns for each line.

Let's say the source data looks like this:

Id Name Codes_A Codes_B
1 George 1,2 3,4
2 Mary 5,6 7,8

I would like the outcome to look something like below, where all the values in the comma-separated columns are combined to each possible outcome.

Id Name Codes_A Codes_B
1 George 1 3
1 George 1 4
1 George 2 3
1 George 2 4
2 Mary 5 7
2 Mary 5 8
2 Mary 6 7
2 Mary 6 8

Upvotes: 1

Views: 1014

Answers (1)

David דודו Markovitz
David דודו Markovitz

Reputation: 44941

Here is the cartesian product of Codes_A & Codes_B values.
I'm assuming this is what you actually need.

Demo setup

df = spark.createDataFrame([(1,'George','1,2','3,4'),(2,'Mary','5,6','7,8')],['Id','Name','Codes_A','Codes_B'])

df.show()

+---+------+-------+-------+
| Id|  Name|Codes_A|Codes_B|
+---+------+-------+-------+
|  1|George|    1,2|    3,4|
|  2|  Mary|    5,6|    7,8|
+---+------+-------+-------+

Solution

import pyspark.sql.functions as F

df_result = (df
             .withColumn('Codes_A', F.explode(F.split('Codes_A',',')))
             .withColumn('Codes_B', F.explode(F.split('Codes_B',',')))
            )

df_result.show()

+---+------+-------+-------+
| Id|  Name|Codes_A|Codes_B|
+---+------+-------+-------+
|  1|George|      1|      3|
|  1|George|      1|      4|
|  1|George|      2|      3|
|  1|George|      2|      4|
|  2|  Mary|      5|      7|
|  2|  Mary|      5|      8|
|  2|  Mary|      6|      7|
|  2|  Mary|      6|      8|
+---+------+-------+-------+

Upvotes: 1

Related Questions