Lizou
Lizou

Reputation: 883

Pyspark split date string

I have a dataframe and want to split the start_date column (string and year) and keep just the year in a new column (column 4):

 ID           start_date         End_date  start_year
|01874938|   August 2013|   December 2014|       2013|
|00798252|    March 2009|        May 2015|       2009|
|02202785|  July 2, 2014|January 15, 2016|        2, |
|01646125| November 2012|   November 2015|       2012|

As you can see I can split the date and keep the years. However for dates like in row 3: "July 2, 2014" the result is "2, " instead of 2014.

This is my code :

split_col = fn.split(df7_ct_map['start_date']  , ' ')
df = df7_ct_map.withColumn('NAME1', split_col.getItem(0))
df = dff.withColumn('start_year', split_col.getItem(1))

Upvotes: 2

Views: 2716

Answers (2)

philantrovert
philantrovert

Reputation: 10092

You could also extract the last 4 characters of your column start_date.

from pyspark.sql import functions as F

df.withColumn('start_year' , 
      F.expr('substring(rtrim(start_date), length(start_date) - 4,length(start_date) )' ) )
  .show()

+-------------+----------+
|   start_date|start_year|
+-------------+----------+
|  August 2013|      2013|
|   March 2009|      2009|
| July 2, 2014|      2014|
|November 2014|      2014|
+-------------+----------+

Upvotes: 2

Shaido
Shaido

Reputation: 28367

You could use a regular expression instead of splitting on ,.

df.withColumn('start_year', regexp_extract(df['start_date'], '\\d{4}', 0))

This will match 4 consecutive numbers, i.e. a year.

Upvotes: 2

Related Questions