HamidOvic
HamidOvic

Reputation: 133

Spark SQL Dataset: Split Multiple Array Columns to individual rows

I'm new to Spark SQL and the Dataset / Dataframe API.

I have columns of which 2 columns both have multi values / arrays in my Dataset.

I want to step through the arrays per line positionally, and output a new row for each set of corresponding positional entries in the arrays. You can see how from the 2 diagrams below.

For example:

Input dataframe / dataset

+---+---------+-----+
| id|       le|leloc|
+---+---------+-----+
|  1|[aaa,bbb]|[1,2]|
|  2|[ccc,ddd]|[3,4]|
+---+---------+-----+

Expected Output dataset

I need output as per below, the data is transformed from columns to rows:

+---+---------+-----+
| id|       le|leloc|
+---+---------+-----+
|  1|aaa      |1    |
|  1|bbb      |2    |
|  2|ccc      |3    |
|  2|ddd      |4    |
+---+---------+-----+

Upvotes: 1

Views: 246

Answers (1)

Ged
Ged

Reputation: 18013

%python
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Gen some data
df1 = spark.createDataFrame([   ( 1, list(['A','B','X']), list(['1', '2', '8'])   ) for x in range(2)], ['value1', 'array1', 'array2'] )    
df2 = spark.createDataFrame([   ( 2, list(['C','D','Y']), list(['3', '4', '9'])   ) for x in range(2)], ['value1', 'array1', 'array2'] )  
df = df1.union(df2).distinct()

# from here specifically for you

col_temp_expr = "transform(array1, (x, i) -> concat(x, ',', array2[i]))"

dfA = df.withColumn("col_temp", expr(col_temp_expr))
dfB = dfA.select("value1", "array2", explode((col("col_temp")))) # Not an array
dfC = dfB.withColumn('tempArray', split(dfB['col'], ',')) # Now an array
dfC.select("value1", dfC.tempArray[0], dfC.tempArray[1]).show()   

returns:

+------+------------+------------+
|value1|tempArray[0]|tempArray[1]|
+------+------------+------------+
|     1|           A|           1|
|     1|           B|           2|
|     1|           X|           8|
|     2|           C|           3|
|     2|           D|           4|
|     2|           Y|           9|
+------+------------+------------+

You can rename cols. This had more elements per array.

Upvotes: 0

Related Questions