Reputation: 13
I have a problem with the following scenario using PySpark version 2.0, I have a DataFrame with a column contains an array with start and end value, e.g.
[1000, 1010]
I would like to know how to create and compute another column which contains an array that holds all the values for the given range? the result of the generated range values column will be:
+--------------+-------------+-----------------------------+
| Description| Accounts| Range|
+--------------+-------------+-----------------------------+
| Range 1| [101, 105]| [101, 102, 103, 104, 105]|
| Range 2| [200, 203]| [200, 201, 202, 203]|
+--------------+-------------+-----------------------------+
Upvotes: 1
Views: 2367
Reputation: 547
you should use UDF (UDF sample) Consider your pyspark data frame name is df, your data frame could be like this:
df = spark.createDataFrame(
[("Range 1", list([101,105])),
("Range 2", list([200, 203]))],
("Description", "Accounts"))
And your solution is like this:
import pyspark.sql.functions as F
import numpy as np
def make_range_number(arr):
number_range = np.arange(arr[0], arr[1]+1, 1).tolist()
return number_range
range_udf = F.udf(make_range_number)
df = df.withColumn("Range", range_udf(F.col("Accounts")))
Have a fun time!:)
Upvotes: 0
Reputation: 767
Try this.
def range_value(a):
start = a[0]
end = a[1] +1
return list(range(start,end))
from pyspark.sql import functions as F
from pyspark.sql import types as pt
df = spark.createDataFrame([("Range 1", list([101,105])), ("Range 2", list([200, 203]))],("Description", "Accounts"))
range_value= F.udf(range_value, pt.ArrayType(pt.IntegerType()))
df = df.withColumn('Range', range_value(F.col('Accounts')))
Output
Upvotes: 2