Date difference in years in PySpark dataframe

I come from Pandas background and new to Spark. I have a dataframe which has id, dob, age as columns. I want to get the age of the user from his dob (in some cases age column is NULL).

+----+------+----------+
| id | age  |   dob    |
+----+------+----------+
|  1 | 24   | NULL     |
|  2 | 25   | NULL     |
|  3 | NULL | 1/1/1973 |
|  4 | NULL | 6/6/1980 |
|  5 | 46   |          |
|  6 | NULL | 1/1/1971 |
+----+------+----------+

I want a new column which will calculate age from dob and current date.

I tried this, but not getting any results from it:

df.withColumn("diff", 
              datediff(to_date(lit("01-06-2020")),
                       to_date(unix_timestamp('dob', "dd-MM-yyyy").cast("timestamp")))).show()

Upvotes: 4

Answers (3)

ZygD

Reputation: 24478

Using months_between like in this answer, but in a different approach:

in my table, I don't have 'age' column yet;
for rounding to full years I use .cast('int').

from pyspark.sql import functions as F
df = df.withColumn('age', (F.months_between(current_date(), F.col('dob')) / 12).cast('int'))

If system date is UTC and your locale is different, a separate date function may be needed:

from pyspark.sql import functions as F
def current_local_date():
    return F.from_utc_timestamp(F.current_timestamp(), 'Europe/Riga').cast('date')
df = df.withColumn('age', (F.months_between(current_local_date(), F.col('dob')) / 12).cast('int'))

Upvotes: 3

Som

Reputation: 6338

I believe it is more appropriate to use months_between when it comes to year difference. we should use datediff only when if you need difference in days

Approach-

 val data =
      """
        | id | age  |   dob
        |  1 | 24   |
        |  2 | 25   |
        |  3 |      | 1/1/1973
        |  4 |      | 6/6/1980
        |  5 | 46   |
        |  6 |      | 1/1/1971
      """.stripMargin

    val stringDS = data.split(System.lineSeparator())
      .map(_.split("\\|").map(_.replaceAll("""^[ \t]+|[ \t]+$""", "")).mkString(","))
      .toSeq.toDS()
    val df = spark.read
      .option("sep", ",")
      .option("inferSchema", "true")
      .option("header", "true")
      .csv(stringDS)
    df.show(false)
    df.printSchema()

    /**
      * +---+----+--------+
      * |id |age |dob     |
      * +---+----+--------+
      * |1  |24  |null    |
      * |2  |25  |null    |
      * |3  |null|1/1/1973|
      * |4  |null|6/6/1980|
      * |5  |46  |null    |
      * |6  |null|1/1/1971|
      * +---+----+--------+
      *
      * root
      * |-- id: integer (nullable = true)
      * |-- age: integer (nullable = true)
      * |-- dob: string (nullable = true)
      */

Find age

  df.withColumn("diff",
      coalesce(col("age"),
      round(months_between(current_date(),to_date(col("dob"), "d/M/yyyy"),true).divide(12),2)
      )
     ).show()

    /**
      * +---+----+--------+-----+
      * | id| age|     dob| diff|
      * +---+----+--------+-----+
      * |  1|  24|    null| 24.0|
      * |  2|  25|    null| 25.0|
      * |  3|null|1/1/1973|47.42|
      * |  4|null|6/6/1980|39.99|
      * |  5|  46|    null| 46.0|
      * |  6|null|1/1/1971|49.42|
      * +---+----+--------+-----+
      */

round it to 0 if you want age in whole number

Upvotes: 4

ernest_k

Reputation: 45339

You need to compute the date difference and convert the result to years, something like this:

df.withColumn('diff', 
    when(col('age').isNull(), 
         floor(datediff(current_date(), to_date(col('dob'), 'M/d/yyyy'))/365.25))\
  .otherwise(col('age'))).show()

Which produces:

+---+----+--------+----+
| id| age|     dob|diff|
+---+----+--------+----+
|  1|  24|    null|  24|
|  2|  25|    null|  25|
|  3|null|1/1/1973|  47|
|  4|null|6/6/1980|  39|
|  5|  46|    null|  46|
|  6|null|1/1/1971|  49|
+---+----+--------+----+

It preserves the age column where not null and computes the difference (in days) between dob and today where age is null. The result is then converted to years (by dividing by 365.25; you may want to confirm this) then floored.

Upvotes: 7

Date difference in years in PySpark dataframe

Answers (3)

Approach-

Find age

Related Questions