Pyspark dataframe- Convert tuple data to rows

Question

I want to convert tuple data in pyspark dataframe into rows based on the two keys. Given is the raw data and expected output.

Schema:

    root
     |-- key_1: string (nullable = true)
     |-- key_2: string (nullable = true)
     |-- prod: string (nullable = true)

Raw data:

key_1|key_2|prod
cust1|order1|(p1,p2,)
cust2|order2|(p1,p2,p3)
cust3|order3|(p1,)

Expected output:

key_1|key_2|prod|category
cust1|order1|p1
cust1|order1|p2
cust1|order1|
cust2|order2|p1
cust2|order2|p2
cust2|order2|p3
cust3|order3|p1
cust3|order3|

Duy Nguyen · Accepted Answer

Spark has a function called explode, allow use to explode list/array in one row to multiple rows, suite your requirement exactly.

But according to your schema, we have to add one more step, convert prod string column to array type

Sample code to convert type

from pyspark.sql.functions import explode
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType

def squared(s):
    # udf function, convert string (p1,p2,p3) to array [p1, p2, p3]
    items = s[1:-2]  # Not sure it is correct with your data, please double check
    return items.split(',')

# Register udf
squared_udf = udf(squared, ArrayType(StringType()))

# Apply udf to conver prod string to real array
df_2 = df.withColumn('prod_list', squared_udf('prod'))

# Explode prod_list
df_2.select(df.key_1, df.key_2, explode(df_2.prod_list)).show()

I have tested and the result is

+-----+------+---+
|key_1| key_2|col|
+-----+------+---+
|cust1|order1| p1|
|cust1|order1| p2|
|cust2|order2| p1|
|cust2|order2| p2|
|cust2|order2| p3|
|cust3|order3| p1|
+-----+------+---+

with sample data

    data = [
        {'key_1': 'cust1', 'key_2': 'order1', 'prod': '(p1,p2,)'},
        {'key_1': 'cust2', 'key_2': 'order2', 'prod': '(p1,p2,p3,)'},
        {'key_1': 'cust3', 'key_2': 'order3', 'prod': '(p1,)'},
    ]

Pyspark dataframe- Convert tuple data to rows

Answers (1)

Related Questions