Reputation:
I am getting the error:
org.apache.spark.sql.analysisexception: cannot resolve 'year'
My input data:
1,2012-07-21,2014-04-09
My code:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class c (id:Int,start:String,end:String)
val c1 = sc.textFile("date.txt")
val c2 = c1.map(_.split(",")).map(r=>(c(r(0).toInt,r(1).toString,r(2).toString)))
val c3 = c2.toDF();
c3.registerTempTable("c4")
val r = sqlContext.sql("select id,datediff(year,to_date(end), to_date(start)) AS date from c4")
What can I do resolve above error?
I have tried the following code but I got the output in days and I need it in years
val r = sqlContext.sql("select id,datediff(to_date(end), to_date(start)) AS date from c4")
Please advise me if i can use any function like to_date to get year difference.
Upvotes: 1
Views: 27326
Reputation: 1871
Since dateDiff
only returns the difference between days. I prefer to use my own UDF.
import java.sql.Timestamp
import java.time.Instant
import java.time.temporal.ChronoUnit
import org.apache.spark.sql.functions.{udf, col}
import org.apache.spark.sql.DataFrame
def timeDiff(chronoUnit: ChronoUnit)(dateA: Timestamp, dateB: Timestamp): Long = {
chronoUnit.between(
Instant.ofEpochMilli(dateA.getTime),
Instant.ofEpochMilli(dateB.getTime)
)
}
def withTimeDiff(dateA: String, dateB: String, colName: String, chronoUnit: ChronoUnit)(df: DataFrame): DataFrame = {
val timeDiffUDF = udf[Long, Timestamp, Timestamp](timeDiff(chronoUnit))
df.withColumn(colName, timeDiffUDF(col(dateA), col(dateB)))
}
Then I call it as a dataframe transformation.
df.transform(withTimeDiff("sleepTime", "wakeupTime", "minutes", ChronoUnit.MINUTES)
Upvotes: 0
Reputation: 1164
One of the above answers doesn't return the right Year when days between two dates less than 365. Below example provides the right year and rounds the month and year to 2 decimal.
Seq(("2019-07-01"),("2019-06-24"),("2019-08-24"),("2018-12-23"),("2018-07-20")).toDF("startDate").select(
col("startDate"),current_date().as("endDate"))
.withColumn("datesDiff", datediff(col("endDate"),col("startDate")))
.withColumn("montsDiff", months_between(col("endDate"),col("startDate")))
.withColumn("montsDiff_round", round(months_between(col("endDate"),col("startDate")),2))
.withColumn("yearsDiff", months_between(col("endDate"),col("startDate"),true).divide(12))
.withColumn("yearsDiff_round", round(months_between(col("endDate"),col("startDate"),true).divide(12),2))
.show()
Outputs:
+----------+----------+---------+-----------+---------------+--------------------+---------------+
| startDate| endDate|datesDiff| montsDiff|montsDiff_round| yearsDiff|yearsDiff_round|
+----------+----------+---------+-----------+---------------+--------------------+---------------+
|2019-07-01|2019-07-24| 23| 0.74193548| 0.74| 0.06182795666666666| 0.06|
|2019-06-24|2019-07-24| 30| 1.0| 1.0| 0.08333333333333333| 0.08|
|2019-08-24|2019-07-24| -31| -1.0| -1.0|-0.08333333333333333| -0.08|
|2018-12-23|2019-07-24| 213| 7.03225806| 7.03| 0.586021505| 0.59|
|2018-07-20|2019-07-24| 369|12.12903226| 12.13| 1.0107526883333333| 1.01|
+----------+----------+---------+-----------+---------------+--------------------+---------------+
You can find a complete working example at below URL
https://sparkbyexamples.com/spark-calculate-difference-between-two-dates-in-days-months-and-years/
Hope this helps.
Happy Learning !!
Upvotes: 4
Reputation: 408
Another simple way to cast the string to dateType in spark sql and apply sql dates and time functions on the columns like following :
import org.apache.spark.sql.types._
val c4 = c3.select(col("id"),col("start").cast(DateType),col("end").cast(DateType))
c4.withColumn("dateDifference", datediff(col("end"),col("start")))
.withColumn("monthDifference", months_between(col("end"),col("start")))
.withColumn("yearDifference", year(col("end"))-year(col("start")))
.show()
Upvotes: 3
Reputation: 1163
val r = sqlContext.sql("select id,datediff(year,to_date(end), to_date(start)) AS date from c4")
In the above code, "year" is not a column in the data frame i.e it is not a valid column in table "c4" that is why analysis exception is thrown as query is invalid, query is not able to find the "year" column.
Use Spark User Defined Function (UDF), that will be a more robust approach.
Upvotes: 1