Reputation: 133
I'm new to Spark SQL and the Dataset / Dataframe API.
I have columns of which 2 columns both have multi values / arrays in my Dataset.
I want to step through the arrays per line positionally, and output a new row for each set of corresponding positional entries in the arrays. You can see how from the 2 diagrams below.
For example:
Input dataframe / dataset
+---+---------+-----+
| id| le|leloc|
+---+---------+-----+
| 1|[aaa,bbb]|[1,2]|
| 2|[ccc,ddd]|[3,4]|
+---+---------+-----+
Expected Output dataset
I need output as per below, the data is transformed from columns to rows:
+---+---------+-----+
| id| le|leloc|
+---+---------+-----+
| 1|aaa |1 |
| 1|bbb |2 |
| 2|ccc |3 |
| 2|ddd |4 |
+---+---------+-----+
Upvotes: 1
Views: 246
Reputation: 18013
%python
from pyspark.sql.functions import *
from pyspark.sql.types import *
# Gen some data
df1 = spark.createDataFrame([ ( 1, list(['A','B','X']), list(['1', '2', '8']) ) for x in range(2)], ['value1', 'array1', 'array2'] )
df2 = spark.createDataFrame([ ( 2, list(['C','D','Y']), list(['3', '4', '9']) ) for x in range(2)], ['value1', 'array1', 'array2'] )
df = df1.union(df2).distinct()
# from here specifically for you
col_temp_expr = "transform(array1, (x, i) -> concat(x, ',', array2[i]))"
dfA = df.withColumn("col_temp", expr(col_temp_expr))
dfB = dfA.select("value1", "array2", explode((col("col_temp")))) # Not an array
dfC = dfB.withColumn('tempArray', split(dfB['col'], ',')) # Now an array
dfC.select("value1", dfC.tempArray[0], dfC.tempArray[1]).show()
returns:
+------+------------+------------+
|value1|tempArray[0]|tempArray[1]|
+------+------------+------------+
| 1| A| 1|
| 1| B| 2|
| 1| X| 8|
| 2| C| 3|
| 2| D| 4|
| 2| Y| 9|
+------+------------+------------+
You can rename cols. This had more elements per array.
Upvotes: 0