Reputation: 997
I want to convert tuple data in pyspark dataframe into rows based on the two keys. Given is the raw data and expected output.
Schema:
root
|-- key_1: string (nullable = true)
|-- key_2: string (nullable = true)
|-- prod: string (nullable = true)
Raw data:
key_1|key_2|prod
cust1|order1|(p1,p2,)
cust2|order2|(p1,p2,p3)
cust3|order3|(p1,)
Expected output:
key_1|key_2|prod|category
cust1|order1|p1
cust1|order1|p2
cust1|order1|
cust2|order2|p1
cust2|order2|p2
cust2|order2|p3
cust3|order3|p1
cust3|order3|
Upvotes: 0
Views: 1299
Reputation: 1005
Spark has a function called explode
, allow use to explode list/array in one row to multiple rows, suite your requirement exactly.
But according to your schema, we have to add one more step, convert prod string column to array type
Sample code to convert type
from pyspark.sql.functions import explode
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
def squared(s):
# udf function, convert string (p1,p2,p3) to array [p1, p2, p3]
items = s[1:-2] # Not sure it is correct with your data, please double check
return items.split(',')
# Register udf
squared_udf = udf(squared, ArrayType(StringType()))
# Apply udf to conver prod string to real array
df_2 = df.withColumn('prod_list', squared_udf('prod'))
# Explode prod_list
df_2.select(df.key_1, df.key_2, explode(df_2.prod_list)).show()
I have tested and the result is
+-----+------+---+
|key_1| key_2|col|
+-----+------+---+
|cust1|order1| p1|
|cust1|order1| p2|
|cust2|order2| p1|
|cust2|order2| p2|
|cust2|order2| p3|
|cust3|order3| p1|
+-----+------+---+
with sample data
data = [
{'key_1': 'cust1', 'key_2': 'order1', 'prod': '(p1,p2,)'},
{'key_1': 'cust2', 'key_2': 'order2', 'prod': '(p1,p2,p3,)'},
{'key_1': 'cust3', 'key_2': 'order3', 'prod': '(p1,)'},
]
Upvotes: 1