How to create a column with all the values in a range given by another column in PySpark

Question

I have a problem with the following scenario using PySpark version 2.0, I have a DataFrame with a column contains an array with start and end value, e.g. [1000, 1010]

I would like to know how to create and compute another column which contains an array that holds all the values for the given range? the result of the generated range values column will be:

    +--------------+-------------+-----------------------------+
    |   Description|     Accounts|                        Range|
    +--------------+-------------+-----------------------------+
    |       Range 1|   [101, 105]|    [101, 102, 103, 104, 105]|
    |       Range 2|   [200, 203]|         [200, 201, 202, 203]|
    +--------------+-------------+-----------------------------+

Rahul · Accepted Answer

Try this.

define the udf

def range_value(a):
    start = a[0]
     end = a[1] +1 
     return list(range(start,end))

from pyspark.sql import functions as F
from pyspark.sql import types as pt

df = spark.createDataFrame([("Range 1", list([101,105])), ("Range 2", list([200, 203]))],("Description", "Accounts"))

range_value= F.udf(range_value, pt.ArrayType(pt.IntegerType()))
df = df.withColumn('Range', range_value(F.col('Accounts')))

Output

How to create a column with all the values in a range given by another column in PySpark

Answers (2)

define the udf

Related Questions