Reputation: 5389
I am using Spark 1.3. I want to do some date-based calculations. In the following dataset, for each unique id, I want to get the record where beging_date is maximum (the latest record).
Also, when I read the data from file, should I cast it as TimestampType (import org.apache.spark.sql.types.TimestampType)?
Here is some sample data:
id beging_date end_date
1 1/1/2016 20:06:00.0 1/4/2016 20:06:00.0
2 1/5/2013 20:06:00.0 1/8/2016 20:06:00.0
1 1/6/2013 20:06:00.0 1/18/2016 20:06:00.0
3 2/1/2013 20:06:00.0 2/5/2016 20:06:00.0
1 1/20/2013 20:06:00.0 2/4/2016 20:06:00.0
3 3/5/2013 20:06:00.0 3/8/2016 20:06:00.0
Here is the desired output:
id beging_date end_date
1 1/20/2013 20:06:00.0 2/4/2016 20:06:00.0
2 1/5/2013 20:06:00.0 1/8/2016 20:06:00.0
3 3/5/2013 20:06:00.0 3/8/2016 20:06:00.0
Upvotes: 0
Views: 1991
Reputation: 13927
When dealing with dates, I think it's important to separately talk about how the data is stored on disk or serialized and how it is stored in memory as objects.
Your dates are serialized as Strings. A string representation is a pretty poor representation; it's probably my 4th choice behind TimestampType
, LongType
, and even DoubleType
. It's always a pain to parse Strings. On the other hand -- your dates are already serialized as Strings, do you need / want to change that?
I would consider leaving the original column intact and creating a new column that would be more conducive to your data munging.
Which leads me to my next point -- how you want to represent the date as an object in memory completely turns on how you want to use it. For an operation like, "find the max value", your simplest might be to convert it to a LongType
-- the number of milliseconds since Jan 1, 1970, for example. Pretty much every date-related object and function can either ingest or spit out a unix
timestamp, so they are easy to convert back and forth.
To convert your strings to a TimestampType
, you need to convert them to a java.sql.Timestamp
. I'll leave it to you figure out the exact formatting, but you want something like:
import java.sql.Timestamp
import java.text.SimpleDateFormat
val formatter = new SimpleDateFormat(...)
val millis_since_1_1_1970 = formatter.parse("1/5/2013 20:06:00.0").getTime
val timestamp = new java.sql.Timestamp(millis_since_1_1_1970)
Like I said, you could just stop after millis_since_1_1_1970
and use that Long
value as your comparison value. For your purposes, it would work
Upvotes: 1