Reputation: 13
I have a problem with the following scenario using Spark, I have a DataFrame with a column contains an array with start and end value, e.g.
[1000, 1010]
would like to know how to create & compute another column contains a array that holding all the values for the given range? the result of the generated range values column will be:
+--------------+-------------+-----------------------------+
| Description| Accounts| Range|
+--------------+-------------+-----------------------------+
| Range 1| [101, 105]| [101, 102, 103, 104, 105]|
| Range 2| [200, 203]| [200, 201, 202, 203]|
+--------------+-------------+-----------------------------+
Thanks in advance
Upvotes: 1
Views: 882
Reputation: 1097
From Spark 2.4 you can use sequence function If you have this dataframe:
df.show()
+--------+
|column_1|
+--------+
| 1|
| 2|
| 3|
| 0|
+--------+
If you use the sequence function from 0 to column_1 you got this:
df.withColumn("range", sequence(lit(0), col("column_1"))).show()
+--------+------------+
|column_1| range|
+--------+------------+
| 1| [0, 1]|
| 2| [0, 1, 2]|
| 3|[0, 1, 2, 3]|
| 0| [0]|
+--------+------------+
For your case, use as params the index of the column Accounts
df.withColumn("Range", sequence(col("Accounts")(0), col("Accounts")(1))).show()
+--------------+-------------+-----------------------------+
| Description| Accounts| Range|
+--------------+-------------+-----------------------------+
| Range 1| [101, 105]| [101, 102, 103, 104, 105]|
| Range 2| [200, 203]| [200, 201, 202, 203]|
+--------------+-------------+-----------------------------+
Upvotes: 1
Reputation: 10092
You'll have to create a UDF for this.
df.show
+-----------+----------+
|Description| Accounts|
+-----------+----------+
| Range 1|[100, 105]|
| Range 2|[200, 203]|
+-----------+----------+
I have tried to cover few of the possible edge cases here. You can add more if you see anything missing.
val createRange = udf{ (xs: Seq[Int]) =>
if(xs.length == 0 ) Array[Int]()
else if (xs.length == 1) (0 to xs(0) ).toArray
else (xs(0) to xs(1) ).toArray
}
Call this UDF createRange
on your Dataframe and pass the Array Accounts
df.withColumn("Range" , createRange($"Accounts") ).show(false)
+-----------+----------+------------------------------+
|Description|Accounts |Range |
+-----------+----------+------------------------------+
|Range 1 |[100, 105]|[100, 101, 102, 103, 104, 105]|
|Range 2 |[200, 203]|[200, 201, 202, 203] |
+-----------+----------+------------------------------+
Upvotes: 2