Reputation: 61
I want to use a date in linear regression. So I have to convert it to a number. And I have to set lowest date 0 and continuously increase a number as per date difference.
Then I can use date field in Linear Regression using Scala, Spark MLlib. I have dataframe ready with some fields including date. For example,
| date | id |
| 01-01-2017 | 12 |
| 01-02-2016 | 13 |
| 05-05-2016 | 22 |
For a string, I have implemented using one hot encoding technique. But for date how can I set first date to 0 and then increase number as per difference? Thanks.
Upvotes: 0
Views: 1555
Reputation: 35249
This depend purely on a model you want to create. For very basic trend modeling you can just cast your data to Unix timestamp:
import org.apache.spark.sql.functions._
val parsed = df.withColumn("date", unix_timestamp($"date", "dd-MM-yyyy"))
No additional processing should be necessary, but you can of course shift it to start at 0, or rescale to more convenient scale.
More advance modeling would including extracting different components like month
or dayofweek
. These in general should be treated as categorical variables, and one-hot-encoded.
Upvotes: 1