Eleanor
Eleanor

Reputation: 2847

Pyspark create multiple rows for a record that include a time range

I have a dataframe looks like this.

A  Start  End
1  1578   1581
1  1789   1790
2  1800   1802

start and end are epoch. And I want to create multiple rows by every seconds, like this

A  time
1  1578
1  1579
1  1580
1  1581
1  1789
1  1790
2  1800
2  1801
2  1802

How to do it in pyspark? (Don't need to keep the order)

Thank you!

Upvotes: 2

Views: 2218

Answers (2)

cph_sto
cph_sto

Reputation: 7585

The idea is to create a list, encompassing the entire time span by including the intermediate seconds. For eg; for Start = 1578 and End = 1581, we create a list [1578,1579,1580,1581]. To create this list, we first create a UDF. Once this list is obtained, we explode it to get the required dataframe.

# Creating the DataFrame
values = [(1,1578,1581),(1,1789,1790),(2,1800,1802)]
df = sqlContext.createDataFrame(values,['A','Start','End'])
df.show()
+---+-----+----+
|  A|Start| End|
+---+-----+----+
|  1| 1578|1581|
|  1| 1789|1790|
|  2| 1800|1802|
+---+-----+----+

# Import requisite packages
from pyspark.sql.functions import udf, col, explode, array, struct
from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType

#Creating UDFs below to create a list.
def make_list(start,end):
    return list(range(start,end+1))
make_list_udf = udf(make_list,ArrayType(IntegerType()))

#Creating Lists of seconds finally.
df = df.withColumn('my_list',make_list_udf(col('Start'),col('End'))).drop('Start','End')
df.show(truncate=False)
+---+------------------------+
|A  |my_list                 |
+---+------------------------+
|1  |[1578, 1579, 1580, 1581]|
|1  |[1789, 1790]            |
|2  |[1800, 1801, 1802]      |
+---+------------------------+

#Exploding the Lists
df = df.withColumn('time', explode('my_list')).drop('my_list')
df.show()
+---+----+
|  A|time|
+---+----+
|  1|1578|
|  1|1579|
|  1|1580|
|  1|1581|
|  1|1789|
|  1|1790|
|  2|1800|
|  2|1801|
|  2|1802|
+---+----+

Upvotes: 4

Stefano Gallotti
Stefano Gallotti

Reputation: 116

Assuming your data are in dataframe df and you have a support dataframe s_df with the seconds, you can do:

df.alias("a").join(s_df.alias("b"), (col("a.Start") >= col("b.time)) & (col("a. End") <= col("b.time)), "inner").select(col("a.A"), col("b.time")).

It may become a problem in case of overlapping of "A"s. In that case you may want to make "A" unique to establish with epoch belongs to which

Upvotes: 0

Related Questions