Reputation: 65
I'm trying to add a new column to a pyspark df by substracting the values of two existing columns.
I already had a date_of_birth
column available, so I inserted a current_date
column with the following code:
import datetime
currentdate = "14-12-2021"
day,month,year = currentdate.split('-')
today = datetime.date(int(year),int(month),int(day))
df= df.withColumn("current_date", lit(today))
Displaying my df confirms that it worked. Looks a little something like this:
id | date_of_birth | current_date |
---|---|---|
01 | 1995-01-01 | 2021-12-2021 |
02 | 1987-02-16 | 2021-12-2021 |
I inserted the age
column by substracting the values of date_of_birth
and current_date
.
df = df.withColumn('age', (df['current_date'] - df['date_of_birth ']))
Cell runs without a problem.
Here's where I'm stuck:
Once I try to display my dataframe again in order to verify that everything went smoothly, the following error occurs:
'could not parse datatype: interval'
I used df.types() to check what's happening, and apparently my newly inserted age
column is of interval type.
How can I fix this?
Is there a way to display the age in years (int) in this particular scenario?
PS: both the date_of_birth
and current_date
cols have a date type.
Upvotes: 3
Views: 4778
Reputation: 124
Would use one of the pyspark functions for calculating difference between dates.
pyspark.sql.functions.datediff
https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.datediff.html
pyspark.sql.functions.months_between
Upvotes: 2
Reputation: 65
Solved it. Mike's comment helped tons. Thank you!
Here's how I solved it:
# insert new column current_date with dummy data (in this case, 1s)
df = df.withColumn("current_date", lit(1))
# update data with current_date() function
df = df .withColumn("current_date", f.current_date())
# insert new column age with dummy data (in this case, 1s)
df = df .withColumn("age", lit(1))
# update data with months_between() function, divide by 12 to obtain years.
df = df .withColumn("age", f.months_between(df.current_date, df .date_of_birth)/12)
# round and cast as interger to get rid of decimals
df = df .withColumn("age", f.round(df["age"]).cast('integer'))
Upvotes: 3