Kit Fung Chai
Kit Fung Chai

Reputation: 13

How to create a column with all the values in a range given by another column

I have a problem with the following scenario using Spark, I have a DataFrame with a column contains an array with start and end value, e.g.

[1000, 1010]

would like to know how to create & compute another column contains a array that holding all the values for the given range? the result of the generated range values column will be:

    +--------------+-------------+-----------------------------+
    |   Description|     Accounts|                        Range|
    +--------------+-------------+-----------------------------+
    |       Range 1|   [101, 105]|    [101, 102, 103, 104, 105]|
    |       Range 2|   [200, 203]|         [200, 201, 202, 203]|
    +--------------+-------------+-----------------------------+

Thanks in advance

Upvotes: 1

Views: 882

Answers (2)

Luis A.G.
Luis A.G.

Reputation: 1097

From Spark 2.4 you can use sequence function If you have this dataframe:

df.show()
+--------+
|column_1|
+--------+
|       1|
|       2|
|       3|
|       0|
+--------+

If you use the sequence function from 0 to column_1 you got this:

df.withColumn("range", sequence(lit(0), col("column_1"))).show()
+--------+------------+
|column_1|       range|
+--------+------------+
|       1|      [0, 1]|
|       2|   [0, 1, 2]|
|       3|[0, 1, 2, 3]|
|       0|         [0]|
+--------+------------+

For your case, use as params the index of the column Accounts

df.withColumn("Range", sequence(col("Accounts")(0), col("Accounts")(1))).show()
 +--------------+-------------+-----------------------------+
 |   Description|     Accounts|                        Range|
 +--------------+-------------+-----------------------------+
 |       Range 1|   [101, 105]|    [101, 102, 103, 104, 105]|
 |       Range 2|   [200, 203]|         [200, 201, 202, 203]|
 +--------------+-------------+-----------------------------+

Upvotes: 1

philantrovert
philantrovert

Reputation: 10092

You'll have to create a UDF for this.

df.show
+-----------+----------+
|Description|  Accounts|
+-----------+----------+
|    Range 1|[100, 105]|
|    Range 2|[200, 203]|
+-----------+----------+

I have tried to cover few of the possible edge cases here. You can add more if you see anything missing.

val createRange = udf{ (xs: Seq[Int]) => 
    if(xs.length == 0 ) Array[Int]()
    else if (xs.length == 1) (0 to xs(0) ).toArray
    else (xs(0) to xs(1) ).toArray 
}

Call this UDF createRange on your Dataframe and pass the Array Accounts

df.withColumn("Range" , createRange($"Accounts") ).show(false)
+-----------+----------+------------------------------+
|Description|Accounts  |Range                         |
+-----------+----------+------------------------------+
|Range 1    |[100, 105]|[100, 101, 102, 103, 104, 105]|
|Range 2    |[200, 203]|[200, 201, 202, 203]          |
+-----------+----------+------------------------------+

Upvotes: 2

Related Questions