Reputation: 2526
I have following Data:
+-----+-----+----+
|Col1 |t0 |t1 |
+-----+-----+----+
| A |null |20 |
| A |20 |40 |
| B |null |10 |
| B |10 |20 |
| B |20 |120 |
| B |120 |140 |
| B |140 |320 |
| B |320 |340 |
| B |340 |360 |
+-----+-----+----+
And what I want is something like this:
+-----+-----+----+----+
|Col1 |t0 |t1 |grp |
+-----+-----+----+----+
| A |null |20 |1A |
| A |20 |40 |1A |
| B |null |10 |1B |
| B |10 |20 |1B |
| B |20 |120 |2B |
| B |120 |140 |2B |
| B |140 |320 |3B |
| B |320 |340 |3B |
| B |340 |360 |3B |
+-----+-----+----+----+
Explanation: The extra column is based on the Col1 and the difference between t1 and t0. When the difference between that two is too high => a new number is generated. (in the dataset above when the difference is greater than 50)
I build t0 with:
val windowSpec = Window.partitionBy($"Col1").orderBy("t1")
df = df.withColumn("t0", lag("t1", 1) over windowSpec)
Can someone help me how to do it? I searched but didn't get a good idea. I'm a little bit lost because I need the value of the previous calculated row of grp...
Thanks
Upvotes: 1
Views: 1149
Reputation: 41987
You can use udf
function to generate the grp
column
def testUdf = udf((col1: String, t0: Int, t1: Int)=> (t1-t0) match {
case x : Int if(x > 50) => 2+col1
case _ => 1+col1
})
Call the udf
function as
df.withColumn("grp", testUdf($"Col1", $"t0", $"t1"))
The udf
function above won't work properly due to null
values in t0
which can be replaced by 0
df.na.fill(0)
I hope this is the answer you are searching for.
Edited
Here's the complete solution using udaf . The process is complex . You've already got easy answer but it might help somebody who might use it
First defining udaf
class Boendal extends UserDefinedAggregateFunction {
def inputSchema = new StructType().add("Col1", StringType).add("t0", IntegerType).add("t1", IntegerType).add("rank", IntegerType)
def bufferSchema = new StructType().add("buff", StringType).add("buffer1", IntegerType)
def dataType = StringType
def deterministic = true
def initialize(buffer: MutableAggregationBuffer) = {
buffer.update(0, "")
buffer.update(1, 0)
}
def update(buffer: MutableAggregationBuffer, input: Row) = {
if (!input.isNullAt(0)) {
val buff = buffer.getString(0)
val col1 = input.getString(0)
val t0 = input.getInt(1)
val t1 = input.getInt(2)
val rank = input.getInt(3)
var value = 1
if((t1-t0) < 50)
value = 1
else
value = (t1-t0)/50
val lastValue = buffer(1).asInstanceOf[Integer]
// if(!buff.isEmpty) {
if (value < lastValue)
value = lastValue
// }
buffer.update(1, value)
var finalString = ""
if(buff.isEmpty){
finalString = rank+";"+value+col1
}
else
finalString = buff+"::"+rank+";"+value+col1
buffer.update(0, finalString)
}
}
def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
val buff1 = buffer1.getString(0)
val buff2 = buffer2.getString(0)
buffer1.update(0, buff1+buff2)
}
def evaluate(buffer: Row) : String = {
buffer.getString(0)
}
}
Then some udfs
def rankUdf = udf((grp: String)=> grp.split(";")(0))
def removeRankUdf = udf((grp: String) => grp.split(";")(1))
And finally call the udaf and udfs
val windowSpec = Window.partitionBy($"Col1").orderBy($"t1")
df = df.withColumn("t0", lag("t1", 1) over windowSpec)
.withColumn("rank", rank() over windowSpec)
df = df.na.fill(0)
val boendal = new Boendal
val df2 = df.groupBy("Col1").agg(boendal($"Col1", $"t0", $"t1", $"rank").as("grp2")).withColumnRenamed("Col1", "Col2")
.withColumn("grp2", explode(split($"grp2", "::")))
.withColumn("rank2", rankUdf($"grp2"))
.withColumn("grp2", removeRankUdf($"grp2"))
df = df.join(df2, df("Col1") === df2("Col2") && df("rank") === df2("rank2"))
.drop("Col2", "rank", "rank2")
df.show(false)
Hope it helps
Upvotes: 1
Reputation: 2526
I solved it myself
val grp = (coalesce(
($"t" - lag($"t", 1).over(windowSpec)),
lit(0)
) > 50).cast("bigint")
df = df.withColumn("grp", sum(grp).over(windowSpec))
With this I don't need both colums (t0 and t1) anymore but can use only t1 (or t) without compute t0.
(I only need to add the value of Col1 but the most important part the number is done and works fine.)
I got the solution from: Spark SQL window function with complex condition
thanks for your help
Upvotes: 3