Pratham Solanki
Pratham Solanki

Reputation: 347

filter spark dataframe based on maximum value of a column

I want to do something like this:

df
.withColumn("newCol", <some formula>)
.filter(s"""newCol > ${(math.min(max("newCol").asInstanceOf[Double],10))}""")

Exception I'm getting:

org.apache.spark.sql.Column cannot be cast to java.lang.Double

Can you please suggest me a way to achieve what i want?

Upvotes: 1

Views: 12598

Answers (4)

HT.
HT.

Reputation: 211

The solution is two parts,

Part I
Find the maximum value, df.select(max($"col1")).first()(0)

Part II
Use that value to filter on it
df.filter($"col1" === df.select(max($"col1")).first()(0)).show

Bonus
To avoid potential errors, you can also get the maximum value in a specific format you need, using the .get family on it df.select(max($"col1")).first.getDouble(0)
In this case col1 is DoubleType, so I chose to pick it in the correct format. You can get pretty much all other types. Options are:
getBoolean, getClass, getDecimal, getFloat, getJavaMap, getLong, getSeq, getString, getTimestamp, getAs, getByte, getDate, getDouble, getInt, getList, getMap, getShort, getStruct, getValuesMap
Making the full solution in this case
df.filter($"col1" === df.select(max($"col1")).first.getDouble(0)).show

Upvotes: 0

Raphael Roth
Raphael Roth

Reputation: 27373

I assume newCol is already present in df, then:

import org.apache.spark.sql.expressions.Window   
import org.apache.spark.sql.functions._

df
.withColumn("max_newCol",max($"newCol").over(Window.partitionBy()))
.filter($"newCol"> least($"max_newCol",lit(10.0)))

Instead of max($"newCol").over(Window.partitionBy()) you can also jjst write max($"newCol").over()

Upvotes: 5

Alper t. Turker
Alper t. Turker

Reputation: 35229

I'd separate both steps and either:

val newDF = df
 .withColumn("newCol", <some formula>)

// Spark 2.1 or later
// With 1.x use join
newDf.alias("l").crossJoin(
  newDf.alias("r")).where($"l.newCol" > least($"r.newCol", lit(10.0)))

or

newDf.where(
  $"newCol" > (newDf.select(max($"newCol")).as[Double].first min 10.0))

Upvotes: 0

neilron
neilron

Reputation: 176

I think dataframe describe function is what you are looking for.

ds.describe("age", "height").show()

// output:  
// summary age   height  
// count   10.0  10.0  
// mean    53.3  178.05  
// stddev  11.6  15.7  
// min     18.0  163.0  
// max     92.0  192.0  

Upvotes: 0

Related Questions