Reputation: 2847
I have a dataframe looks like this.
A Start End
1 1578 1581
1 1789 1790
2 1800 1802
start and end are epoch. And I want to create multiple rows by every seconds, like this
A time
1 1578
1 1579
1 1580
1 1581
1 1789
1 1790
2 1800
2 1801
2 1802
How to do it in pyspark? (Don't need to keep the order)
Thank you!
Upvotes: 2
Views: 2218
Reputation: 7585
The idea is to create a list
, encompassing the entire time span by including the intermediate seconds
. For eg; for Start = 1578
and End = 1581
, we create a list [1578,1579,1580,1581]
. To create this list, we first create a UDF
. Once this list is obtained, we explode it to get the required dataframe
.
# Creating the DataFrame
values = [(1,1578,1581),(1,1789,1790),(2,1800,1802)]
df = sqlContext.createDataFrame(values,['A','Start','End'])
df.show()
+---+-----+----+
| A|Start| End|
+---+-----+----+
| 1| 1578|1581|
| 1| 1789|1790|
| 2| 1800|1802|
+---+-----+----+
# Import requisite packages
from pyspark.sql.functions import udf, col, explode, array, struct
from pyspark.sql.types import ArrayType, StructType, StructField, IntegerType
#Creating UDFs below to create a list.
def make_list(start,end):
return list(range(start,end+1))
make_list_udf = udf(make_list,ArrayType(IntegerType()))
#Creating Lists of seconds finally.
df = df.withColumn('my_list',make_list_udf(col('Start'),col('End'))).drop('Start','End')
df.show(truncate=False)
+---+------------------------+
|A |my_list |
+---+------------------------+
|1 |[1578, 1579, 1580, 1581]|
|1 |[1789, 1790] |
|2 |[1800, 1801, 1802] |
+---+------------------------+
#Exploding the Lists
df = df.withColumn('time', explode('my_list')).drop('my_list')
df.show()
+---+----+
| A|time|
+---+----+
| 1|1578|
| 1|1579|
| 1|1580|
| 1|1581|
| 1|1789|
| 1|1790|
| 2|1800|
| 2|1801|
| 2|1802|
+---+----+
Upvotes: 4
Reputation: 116
Assuming your data are in dataframe df and you have a support dataframe s_df with the seconds, you can do:
df.alias("a").join(s_df.alias("b"), (col("a.Start") >= col("b.time)) & (col("a. End") <= col("b.time)), "inner").select(col("a.A"), col("b.time")).
It may become a problem in case of overlapping of "A"s. In that case you may want to make "A" unique to establish with epoch belongs to which
Upvotes: 0