amellam
amellam

Reputation: 13

How to create a column with all the values in a range given by another column in PySpark

I have a problem with the following scenario using PySpark version 2.0, I have a DataFrame with a column contains an array with start and end value, e.g. [1000, 1010]

I would like to know how to create and compute another column which contains an array that holds all the values for the given range? the result of the generated range values column will be:

    +--------------+-------------+-----------------------------+
    |   Description|     Accounts|                        Range|
    +--------------+-------------+-----------------------------+
    |       Range 1|   [101, 105]|    [101, 102, 103, 104, 105]|
    |       Range 2|   [200, 203]|         [200, 201, 202, 203]|
    +--------------+-------------+-----------------------------+

Upvotes: 1

Views: 2367

Answers (2)

yasi
yasi

Reputation: 547

you should use UDF (UDF sample) Consider your pyspark data frame name is df, your data frame could be like this:

df = spark.createDataFrame(
[("Range 1", list([101,105])), 
 ("Range 2", list([200, 203]))],
("Description", "Accounts"))

And your solution is like this:

import pyspark.sql.functions as F
import numpy as np

def make_range_number(arr):
    number_range = np.arange(arr[0], arr[1]+1, 1).tolist()
    return number_range

range_udf = F.udf(make_range_number)

df = df.withColumn("Range", range_udf(F.col("Accounts")))

Have a fun time!:)

Upvotes: 0

Rahul
Rahul

Reputation: 767

Try this.

define the udf

def range_value(a):
    start = a[0]
     end = a[1] +1 
     return list(range(start,end))

from pyspark.sql import functions as F
from pyspark.sql import types as pt

df = spark.createDataFrame([("Range 1", list([101,105])), ("Range 2", list([200, 203]))],("Description", "Accounts"))

range_value= F.udf(range_value, pt.ArrayType(pt.IntegerType()))
df = df.withColumn('Range', range_value(F.col('Accounts')))

Output

enter image description here

Upvotes: 2

Related Questions