user8510536
user8510536

Reputation:

how to get months,years difference between two dates in sparksql

I am getting the error:

org.apache.spark.sql.analysisexception: cannot resolve 'year'

My input data:

1,2012-07-21,2014-04-09

My code:

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class c (id:Int,start:String,end:String)
val c1 = sc.textFile("date.txt")
val c2 = c1.map(_.split(",")).map(r=>(c(r(0).toInt,r(1).toString,r(2).toString)))
val c3 = c2.toDF();
c3.registerTempTable("c4")
val r = sqlContext.sql("select id,datediff(year,to_date(end), to_date(start)) AS date from c4")

What can I do resolve above error?

I have tried the following code but I got the output in days and I need it in years

val r = sqlContext.sql("select id,datediff(to_date(end), to_date(start)) AS date from c4")

Please advise me if i can use any function like to_date to get year difference.

Upvotes: 1

Views: 27326

Answers (4)

Franzi
Franzi

Reputation: 1871

Since dateDiff only returns the difference between days. I prefer to use my own UDF.

import java.sql.Timestamp
import java.time.Instant
import java.time.temporal.ChronoUnit

import org.apache.spark.sql.functions.{udf, col}
import org.apache.spark.sql.DataFrame

def timeDiff(chronoUnit: ChronoUnit)(dateA: Timestamp, dateB: Timestamp): Long = {
    chronoUnit.between(
      Instant.ofEpochMilli(dateA.getTime),
      Instant.ofEpochMilli(dateB.getTime)
    )
}

def withTimeDiff(dateA: String, dateB: String, colName: String, chronoUnit: ChronoUnit)(df: DataFrame): DataFrame = {
    val timeDiffUDF = udf[Long, Timestamp, Timestamp](timeDiff(chronoUnit))
    df.withColumn(colName, timeDiffUDF(col(dateA), col(dateB)))
}

Then I call it as a dataframe transformation.

df.transform(withTimeDiff("sleepTime", "wakeupTime", "minutes", ChronoUnit.MINUTES)

Upvotes: 0

Naveen Nelamali
Naveen Nelamali

Reputation: 1164

One of the above answers doesn't return the right Year when days between two dates less than 365. Below example provides the right year and rounds the month and year to 2 decimal.

Seq(("2019-07-01"),("2019-06-24"),("2019-08-24"),("2018-12-23"),("2018-07-20")).toDF("startDate").select(
col("startDate"),current_date().as("endDate"))
.withColumn("datesDiff", datediff(col("endDate"),col("startDate")))
.withColumn("montsDiff", months_between(col("endDate"),col("startDate")))
.withColumn("montsDiff_round", round(months_between(col("endDate"),col("startDate")),2))
.withColumn("yearsDiff", months_between(col("endDate"),col("startDate"),true).divide(12))
.withColumn("yearsDiff_round", round(months_between(col("endDate"),col("startDate"),true).divide(12),2))
.show()

Outputs:

+----------+----------+---------+-----------+---------------+--------------------+---------------+
| startDate|   endDate|datesDiff|  montsDiff|montsDiff_round|           yearsDiff|yearsDiff_round|
+----------+----------+---------+-----------+---------------+--------------------+---------------+
|2019-07-01|2019-07-24|       23| 0.74193548|           0.74| 0.06182795666666666|           0.06|
|2019-06-24|2019-07-24|       30|        1.0|            1.0| 0.08333333333333333|           0.08|
|2019-08-24|2019-07-24|      -31|       -1.0|           -1.0|-0.08333333333333333|          -0.08|
|2018-12-23|2019-07-24|      213| 7.03225806|           7.03|         0.586021505|           0.59|
|2018-07-20|2019-07-24|      369|12.12903226|          12.13|  1.0107526883333333|           1.01|
+----------+----------+---------+-----------+---------------+--------------------+---------------+

You can find a complete working example at below URL

https://sparkbyexamples.com/spark-calculate-difference-between-two-dates-in-days-months-and-years/

Hope this helps.

Happy Learning !!

Upvotes: 4

Rishikesh Teke
Rishikesh Teke

Reputation: 408

Another simple way to cast the string to dateType in spark sql and apply sql dates and time functions on the columns like following :

import org.apache.spark.sql.types._
val c4 = c3.select(col("id"),col("start").cast(DateType),col("end").cast(DateType))

c4.withColumn("dateDifference", datediff(col("end"),col("start")))
  .withColumn("monthDifference", months_between(col("end"),col("start")))
  .withColumn("yearDifference", year(col("end"))-year(col("start")))
  .show()

Upvotes: 3

hagarwal
hagarwal

Reputation: 1163

val r = sqlContext.sql("select id,datediff(year,to_date(end), to_date(start)) AS date from c4")

In the above code, "year" is not a column in the data frame i.e it is not a valid column in table "c4" that is why analysis exception is thrown as query is invalid, query is not able to find the "year" column.

Use Spark User Defined Function (UDF), that will be a more robust approach.

Upvotes: 1

Related Questions