user3803714
user3803714

Reputation: 5389

SparkSQL datetime function

I am using Spark 1.3. I want to do some date-based calculations. In the following dataset, for each unique id, I want to get the record where beging_date is maximum (the latest record).

Also, when I read the data from file, should I cast it as TimestampType (import org.apache.spark.sql.types.TimestampType)?

Here is some sample data:

id beging_date end_date

1 1/1/2016 20:06:00.0 1/4/2016 20:06:00.0

2 1/5/2013 20:06:00.0 1/8/2016 20:06:00.0

1 1/6/2013 20:06:00.0 1/18/2016 20:06:00.0

3 2/1/2013 20:06:00.0 2/5/2016 20:06:00.0

1 1/20/2013 20:06:00.0 2/4/2016 20:06:00.0

3 3/5/2013 20:06:00.0 3/8/2016 20:06:00.0

Here is the desired output:

id beging_date end_date

1 1/20/2013 20:06:00.0 2/4/2016 20:06:00.0

2 1/5/2013 20:06:00.0 1/8/2016 20:06:00.0

3 3/5/2013 20:06:00.0 3/8/2016 20:06:00.0

Upvotes: 0

Views: 1991

Answers (1)

David Griffin
David Griffin

Reputation: 13927

When dealing with dates, I think it's important to separately talk about how the data is stored on disk or serialized and how it is stored in memory as objects.

Your dates are serialized as Strings. A string representation is a pretty poor representation; it's probably my 4th choice behind TimestampType, LongType, and even DoubleType. It's always a pain to parse Strings. On the other hand -- your dates are already serialized as Strings, do you need / want to change that?

I would consider leaving the original column intact and creating a new column that would be more conducive to your data munging.

Which leads me to my next point -- how you want to represent the date as an object in memory completely turns on how you want to use it. For an operation like, "find the max value", your simplest might be to convert it to a LongType -- the number of milliseconds since Jan 1, 1970, for example. Pretty much every date-related object and function can either ingest or spit out a unix timestamp, so they are easy to convert back and forth.

To convert your strings to a TimestampType, you need to convert them to a java.sql.Timestamp. I'll leave it to you figure out the exact formatting, but you want something like:

import java.sql.Timestamp
import java.text.SimpleDateFormat

val formatter = new SimpleDateFormat(...)
val millis_since_1_1_1970 = formatter.parse("1/5/2013 20:06:00.0").getTime
val timestamp = new java.sql.Timestamp(millis_since_1_1_1970)

Like I said, you could just stop after millis_since_1_1_1970 and use that Long value as your comparison value. For your purposes, it would work

Upvotes: 1

Related Questions