Reputation: 1117
I have the below Spark dataframe/dataset. Column_2 has dates in string format.
Column_1 Column_2
A 2020-08-05
B 2020-08-01
B 2020-09-20
B 2020-12-31
C 2020-05-10
My expected output dataframe should have only one row per value in Column_1 and if there are multiple dates in column_2 for same key in column_1, then the next available date should be picked. if only one row is there, then the date should be retained
Expected Output:
Column_1 Column_2
A 2020-08-05
B 2020-09-20
C 2020-05-10
Is there a way to achieve this Java spark? possibly without using UDF?
Upvotes: 0
Views: 816
Reputation: 6338
Perhaps this is helpful-
dataset.show(false);
dataset.printSchema();
/**
*+--------+----------+
* |Column_1|Column_2 |
* +--------+----------+
* |A |2020-08-05|
* |D |2020-08-01|
* |D |2020-08-02|
* |B |2020-08-01|
* |B |2020-09-20|
* |B |2020-12-31|
* |C |2020-05-10|
* +--------+----------+
*
* root
* |-- Column_1: string (nullable = true)
* |-- Column_2: string (nullable = true)
*/
dataset.withColumn("Column_2", to_date(col("Column_2")))
.withColumn("count", count("Column_2").over(Window.partitionBy("Column_1")))
.withColumn("positive", when(col("count").gt(1),
when(col("Column_2").gt(current_date()), col("Column_2"))
).otherwise(col("Column_2")))
.withColumn("negative", when(col("count").gt(1),
when(col("Column_2").lt(current_date()), col("Column_2"))
).otherwise(col("Column_2")))
.groupBy("Column_1")
.agg(min("positive").as("positive"), max("negative").as("negative"))
.selectExpr("Column_1", "coalesce(positive, negative) as Column_2")
.show(false);
/**
* +--------+----------+
* |Column_1|Column_2 |
* +--------+----------+
* |A |2020-08-05|
* |D |2020-08-02|
* |B |2020-09-20|
* |C |2020-05-10|
* +--------+----------+
*/
Upvotes: 1
Reputation: 13581
SCALA: This will give the result.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("Column_1")
df.withColumn("count", count("Column_2").over(w))
.withColumn("later", expr("IF(Column_2 > date(current_timestamp), True, False)"))
.filter("count = 1 or (count != 1 and later = True)")
.groupBy("Column_1")
.agg(min("Column_2").alias("Column_2"))
.orderBy("Column_1")
.show(false)
+--------+----------+
|Column_1|Column_2 |
+--------+----------+
|A |2020-08-05|
|B |2020-09-20|
|C |2020-05-10|
+--------+----------+
It has an exception that if the count of the dates for the Column_1
is larger than 1
and there is no date after the current_timestamp
, it will not give the result for the value of Column_1
.
Upvotes: 0
Reputation: 2011
Create the DataFrame First
df_b = spark.createDataFrame([("A","2020-08-05"),("B","2020-08-01"),("B","2020-09-20"),("B","2020-12-31"),("C","2020-05-10")],[ "col1","col2"])
_w = W.partitionBy("col1").orderBy("col1")
df_b = df_b.withColumn("rn", F.row_number().over(_w))
The logic here to pick the second element of each group if any group has a more than one row. In order to do that we can first assign a row number to every group and we will pick first element of every group where row count is 1 and , first 2 row of every group where row count is more than 1 in every group.
case = F.expr("""
CASE WHEN rn =1 THEN 1
WHEN rn =2 THEN 1
END""")
df_b = df_b.withColumn('case_condition', case)
df_b = df_b.filter(F.col("case_condition") == F.lit("1"))
Intermediate Output
+----+----------+---+--------------+
|col1| col2| rn|case_condition|
+----+----------+---+--------------+
| B|2020-08-01| 1| 1|
| B|2020-09-20| 2| 1|
| C|2020-05-10| 1| 1|
| A|2020-08-05| 1| 1|
+----+----------+---+--------------+
Now, finally just take the last element of every group --
df = df_b.groupBy("col1").agg(F.last("col2").alias("col2")).orderBy("col1")
df.show()
+----+----------+
|col1| col2|
+----+----------+
| A|2020-08-05|
| B|2020-09-20|
| C|2020-05-10|
+----+----------+
Upvotes: 0