Chuck
Chuck

Reputation: 1293

How can I get the simple difference in months between two Pyspark dates? (in the same manner as SAS intck)

I need to find the difference between two dates in Pyspark - but mimicking the behavior of SAS intck function.

I tabulated the difference below.

import pyspark.sql.functions as F
import datetime

ref_date = '2023-02-24'

Data = [
    (1, datetime.date(2023, 1, 23), 1),
    (2, datetime.date(2023, 1, 24), 1),
    (3, datetime.date(2023, 1, 30), 1),
    (4, datetime.date(2022, 11, 30), 3),
    (5, datetime.date(2022, 11, 11), 3)
]

col = ['id', 'dt', 'SAS_months_diff']

df = spark.createDataFrame(Data, col)

df = df.withColumn('spark_months_diff', F.months_between(F.lit(ref_date), F.col('dt')).cast('integer'))
df = df.withColumn('spark_months_diff_x', F.months_between(F.lit(ref_date), F.col('dt')))

df.show()


+---+----------+---------------+-----------------+-------------------+
| id|        dt|SAS_months_diff|spark_months_diff|spark_months_diff_x|
+---+----------+---------------+-----------------+-------------------+
|  1|2023-01-23|              1|                1|         1.03225806|
|  2|2023-01-24|              1|                1|                1.0|
|  3|2023-01-30|              1|                0|         0.80645161|
|  4|2022-11-30|              3|                2|         2.80645161|
|  5|2022-11-11|              3|                3|         3.41935484|
+---+----------+---------------+-----------------+-------------------+

The problem is where id = 3 and 4, what SAS calculates as months between two dates is different from pyspark. The difference is before and after the day of the month.

I came up with this hack: use the spark month where the day is <= 24, otherwise add one.

day = ref_date.split('-')[2]
df = df.withColumn('new_month', F.when(F.dayofmonth('dt') <= day,\ 
F.months_between(F.lit(ref_date), F.col('dt')).cast('integer'))\
.otherwise(F.months_between(F.lit(ref_date),\ 
F.col('dt')).cast('integer') + 1))


df.select('dt', 'SAS_months_diff', 'new_month').show()

+----------+---------------+---------+
|        dt|SAS_months_diff|new_month|
+----------+---------------+---------+
|2023-01-23|              1|        1|
|2023-01-24|              1|        1|
|2023-01-30|              1|        1|
|2022-11-30|              3|        3|
|2022-11-11|              3|        3|
+----------+---------------+---------+

Is there any other better solution?

Upvotes: 2

Views: 984

Answers (1)

Yousry
Yousry

Reputation: 76

We can get the year diff, month diff and then simulate the intck as:

year_diff * 12 + month_diff.

Try the following and see if it works for you:

import pyspark.sql.functions as F
import datetime

ref_date = '2023-02-24'

Data = [
    (1, datetime.date(2023, 1, 23), 1),
    (2, datetime.date(2023, 1, 24), 1),
    (3, datetime.date(2023, 1, 30), 1),
    (4, datetime.date(2022, 11, 30), 3),
    (5, datetime.date(2022, 11, 11), 3)
]

col = ['id', 'dt', 'SAS_months_diff']

df = spark.createDataFrame(Data, col)

df = df.withColumn('spark_months_diff', F.months_between(F.lit(ref_date), F.col('dt')).cast('integer'))
df = df.withColumn('spark_months_diff_x', F.months_between(F.lit(ref_date), F.col('dt')))
df = df.withColumn('year_diff', F.year(F.lit(ref_date)) - F.year(F.col('dt')))
df = df.withColumn('month_diff', F.month(F.lit(ref_date)) - F.month(F.col('dt')))
df = df.withColumn("new_spark_months_diff", F.expr("year_diff * 12 + month_diff"))
df.show()

Upvotes: 0

Related Questions