BdEngineer
BdEngineer

Reputation: 3209

How to filter dataframe using two dates?

I have a scenario where dataframe has data_date as below

    root
     |-- data_date: timestamp (nullable = true)

    +-------------------+
    |          data_date|
    +-------------------+
    |2009-10-19 00:00:00|
    |2004-02-24 00:00:00|
    +-------------------+

I Need to filter the data between two dates i.e. data_date between '01-Jan-2017' and '31-dec-2017'

I tried many ways like

df.where(col("data_date") >= "2017-01-01" )     
df.filter(col("data_date").gt("2017-01-01"))   
df.filter(col("data_date").gt(lit("2017-01-01"))).filter(col("data_date").lt("2017-12-31")

but nothing worked.

I am getting below error:

java.lang.AssertionError: assertion failed: unsafe symbol Unstable (child of <none>) in runtime reflection universe
    at scala.reflect.internal.Symbols$Symbol.<init>(Symbols.scala:205)
    at scala.reflect.internal.Symbols$TypeSymbol.<init>(Symbols.scala:3030)
    at scala.reflect.internal.Symbols$ClassSymbol.<init>(Symbols.scala:3222)
    at scala.reflect.internal.Symbols$StubClassSymbol.<init>(Symbols.scala:3522)
    at scala.reflect.internal.Symbols$class.newStubSymbol(Symbols.scala:191)
    at scala.reflect.internal.SymbolTable.newStubSymbol(SymbolTable.scala:16)\

How can I solve it?

Upvotes: 0

Views: 88

Answers (1)

stack0114106
stack0114106

Reputation: 8791

You need to cast the literal value as "date" datatype. BTW.. the input is not between the condition that you are specifying. Check this out:

scala> val df = Seq(("2009-10-19 00:00:00"),("2004-02-24 00:00:00")).toDF("data_date").select('data_date.cast("timestamp"))
df: org.apache.spark.sql.DataFrame = [data_date: timestamp]

scala> df.printSchema
root
 |-- data_date: timestamp (nullable = true)


scala> df.withColumn("greater",'data_date.gt(lit("2017-01-01").cast("date"))).withColumn("lesser",'data_date.lt(lit("2017-12-31").cast("date"))).show

+-------------------+-------+------+
|          data_date|greater|lesser|
+-------------------+-------+------+
|2009-10-19 00:00:00|  false|  true|
|2004-02-24 00:00:00|  false|  true|
+-------------------+-------+------+

scala>

If I change the input as below, the filter works.

val df = Seq(("2017-10-19 00:00:00"),("2017-02-24 00:00:00")).toDF("data_date").select('data_date.cast("timestamp"))
val df2= df.withColumn("greater",'data_date.gt(lit("2017-01-01").cast("date"))).withColumn("lesser",'data_date.lt(lit("2017-12-31").cast("date")))
df2.filter("greater and lesser ").show(false)

+-------------------+-------+------+
|data_date          |greater|lesser|
+-------------------+-------+------+
|2017-10-19 00:00:00|true   |true  |
|2017-02-24 00:00:00|true   |true  |
+-------------------+-------+------+

Upvotes: 1

Related Questions