Reputation: 33
I have the following dataframe:
col1 col2
1 2020-02-27 15:00:00
1 2020-02-27 15:04:00
I need the output as
col1 col2 col3
1 2020-02-27 15:00
1 2020-02-27 15:04 Y
Based on the maximum timestamp value present in col2, col3 value has to be populated as Y or null.
I have tried the below approach:
df = spark.sql("select col1,col2 from table")
max_ts = df.select(max("col2")).show()
y=(f.when(f.col('col2') == max_ts, "Y"))
df1 = df.withColumn('col3',y)
The above approach yields only null output.
Kindly suggest the possible solution or the mistakes?
TIA.
Edit: I need to perform groupBy on col1 and fetch max value in col2
Upvotes: 1
Views: 986
Reputation: 6323
Perhaps this is helpful-
max(..).over(window)
df2.show(false)
df2.printSchema()
/**
* +----+-------------------+
* |col1|col2 |
* +----+-------------------+
* |1 |2020-02-27 15:00:00|
* |1 |2020-02-27 15:04:00|
* +----+-------------------+
*
* root
* |-- col1: integer (nullable = true)
* |-- col2: timestamp (nullable = true)
*/
val w = Window.partitionBy("col1")
df2.withColumn("col3",
when(max("col2").over(w).cast("long") - col("col2").cast("long")=== 0, "Y")
)
.show(false)
/**
* +----+-------------------+----+
* |col1|col2 |col3|
* +----+-------------------+----+
* |1 |2020-02-27 15:00:00|null|
* |1 |2020-02-27 15:04:00|Y |
* +----+-------------------+----+
*/
df2.createOrReplaceTempView("table")
spark.sql(
"""
| select col1, col2,
| case when (cast(max(col2) over (partition by col1) as long) - cast(col2 as long) = 0) then 'Y' end as col3
| from table
""".stripMargin)
.show(false)
/**
* +----+-------------------+----+
* |col1|col2 |col3|
* +----+-------------------+----+
* |1 |2020-02-27 15:00:00|null|
* |1 |2020-02-27 15:04:00|Y |
* +----+-------------------+----+
*/
Upvotes: 1