Pyspark: how to fix 'could not parse datatype: interval' error

Question

I'm trying to add a new column to a pyspark df by substracting the values of two existing columns.

I already had a date_of_birth column available, so I inserted a current_date column with the following code:

import datetime
currentdate = "14-12-2021"
day,month,year = currentdate.split('-')
today = datetime.date(int(year),int(month),int(day))

df= df.withColumn("current_date", lit(today))

Displaying my df confirms that it worked. Looks a little something like this:

id	date_of_birth	current_date
01	1995-01-01	2021-12-2021
02	1987-02-16	2021-12-2021

I inserted the age column by substracting the values of date_of_birth and current_date.

df = df.withColumn('age', (df['current_date'] - df['date_of_birth ']))

Cell runs without a problem.

Here's where I'm stuck:

Once I try to display my dataframe again in order to verify that everything went smoothly, the following error occurs:

'could not parse datatype: interval'

I used df.types() to check what's happening, and apparently my newly inserted age column is of interval type.

How can I fix this?

Is there a way to display the age in years (int) in this particular scenario?

PS: both the date_of_birth and current_date cols have a date type.

calle · Accepted Answer

Solved it. Mike's comment helped tons. Thank you!

Here's how I solved it:


# insert new column current_date with dummy data (in this case, 1s)
df = df.withColumn("current_date", lit(1))

# update data with current_date() function
df  = df .withColumn("current_date", f.current_date())

# insert new column age with dummy data (in this case, 1s)
df  = df .withColumn("age", lit(1))

# update data with months_between() function, divide by 12 to obtain years.
df  = df  .withColumn("age", f.months_between(df.current_date, df  .date_of_birth)/12)

# round and cast as interger to get rid of decimals
df  = df  .withColumn("age", f.round(df["age"]).cast('integer'))

Pyspark: how to fix 'could not parse datatype: interval' error

Answers (2)

Related Questions

Pyspark: how to fix &#39;could not parse datatype: interval&#39; error

Answers (2)

Related Questions

Pyspark: how to fix 'could not parse datatype: interval' error