Spark drop duplicates and select row with max value

Question

I'm trying to drop duplicates based on column1 and select the row with max value in column2. The column2 has "year"(2019,2020 etc) as values and it is of type "String". The solution I have is, converting the column 2 into integer and selecting the max value.

Dataset ds ; //The dataset with column1,column2(year), column3 etc.
Dataset newDs = ds.withColumn("column2Int", col("column2").cast(DataTypes.IntegerType));
newDs  = newDs.groupBy("column1").max("column2Int"); // drops all other columns

This approach drops all other columns in the original dataset 'ds' when I do a "group by". So I have to do a join between 'ds' and 'newDS' to get back all the original columns. Also casting the String column to Integer looks like an ineffective workaround.

Is it possible to drop the duplicates and get the row with bigger string value from the original dataset itself ?

Spark drop duplicates and select row with max value

Answers (1)

Related Questions