Reputation: 1293
I need to find the difference between two dates in Pyspark - but mimicking the behavior of SAS intck
function.
I tabulated the difference below.
import pyspark.sql.functions as F
import datetime
ref_date = '2023-02-24'
Data = [
(1, datetime.date(2023, 1, 23), 1),
(2, datetime.date(2023, 1, 24), 1),
(3, datetime.date(2023, 1, 30), 1),
(4, datetime.date(2022, 11, 30), 3),
(5, datetime.date(2022, 11, 11), 3)
]
col = ['id', 'dt', 'SAS_months_diff']
df = spark.createDataFrame(Data, col)
df = df.withColumn('spark_months_diff', F.months_between(F.lit(ref_date), F.col('dt')).cast('integer'))
df = df.withColumn('spark_months_diff_x', F.months_between(F.lit(ref_date), F.col('dt')))
df.show()
+---+----------+---------------+-----------------+-------------------+
| id| dt|SAS_months_diff|spark_months_diff|spark_months_diff_x|
+---+----------+---------------+-----------------+-------------------+
| 1|2023-01-23| 1| 1| 1.03225806|
| 2|2023-01-24| 1| 1| 1.0|
| 3|2023-01-30| 1| 0| 0.80645161|
| 4|2022-11-30| 3| 2| 2.80645161|
| 5|2022-11-11| 3| 3| 3.41935484|
+---+----------+---------------+-----------------+-------------------+
The problem is where id = 3 and 4, what SAS calculates as months between two dates is different from pyspark. The difference is before and after the day of the month.
I came up with this hack: use the spark month where the day is <= 24, otherwise add one.
day = ref_date.split('-')[2]
df = df.withColumn('new_month', F.when(F.dayofmonth('dt') <= day,\
F.months_between(F.lit(ref_date), F.col('dt')).cast('integer'))\
.otherwise(F.months_between(F.lit(ref_date),\
F.col('dt')).cast('integer') + 1))
df.select('dt', 'SAS_months_diff', 'new_month').show()
+----------+---------------+---------+
| dt|SAS_months_diff|new_month|
+----------+---------------+---------+
|2023-01-23| 1| 1|
|2023-01-24| 1| 1|
|2023-01-30| 1| 1|
|2022-11-30| 3| 3|
|2022-11-11| 3| 3|
+----------+---------------+---------+
Is there any other better solution?
Upvotes: 2
Views: 984
Reputation: 76
We can get the year diff, month diff and then simulate the intck as:
year_diff * 12 + month_diff
.
Try the following and see if it works for you:
import pyspark.sql.functions as F
import datetime
ref_date = '2023-02-24'
Data = [
(1, datetime.date(2023, 1, 23), 1),
(2, datetime.date(2023, 1, 24), 1),
(3, datetime.date(2023, 1, 30), 1),
(4, datetime.date(2022, 11, 30), 3),
(5, datetime.date(2022, 11, 11), 3)
]
col = ['id', 'dt', 'SAS_months_diff']
df = spark.createDataFrame(Data, col)
df = df.withColumn('spark_months_diff', F.months_between(F.lit(ref_date), F.col('dt')).cast('integer'))
df = df.withColumn('spark_months_diff_x', F.months_between(F.lit(ref_date), F.col('dt')))
df = df.withColumn('year_diff', F.year(F.lit(ref_date)) - F.year(F.col('dt')))
df = df.withColumn('month_diff', F.month(F.lit(ref_date)) - F.month(F.col('dt')))
df = df.withColumn("new_spark_months_diff", F.expr("year_diff * 12 + month_diff"))
df.show()
Upvotes: 0