Interstellar
Interstellar

Reputation: 61

Date use in linear regression and conversion of date to numbers using spark mllib

I want to use a date in linear regression. So I have to convert it to a number. And I have to set lowest date 0 and continuously increase a number as per date difference.

Then I can use date field in Linear Regression using Scala, Spark MLlib. I have dataframe ready with some fields including date. For example,

| date       | id |
| 01-01-2017 | 12 |
| 01-02-2016 | 13 |
| 05-05-2016 | 22 |

For a string, I have implemented using one hot encoding technique. But for date how can I set first date to 0 and then increase number as per difference? Thanks.

Upvotes: 0

Views: 1555

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35249

This depend purely on a model you want to create. For very basic trend modeling you can just cast your data to Unix timestamp:

import org.apache.spark.sql.functions._

val parsed = df.withColumn("date", unix_timestamp($"date", "dd-MM-yyyy"))

No additional processing should be necessary, but you can of course shift it to start at 0, or rescale to more convenient scale.

More advance modeling would including extracting different components like month or dayofweek. These in general should be treated as categorical variables, and one-hot-encoded.

Upvotes: 1

Related Questions