Reputation: 883
I have a dataframe and want to split the start_date
column (string and year) and keep just the year in a new column (column 4):
ID start_date End_date start_year
|01874938| August 2013| December 2014| 2013|
|00798252| March 2009| May 2015| 2009|
|02202785| July 2, 2014|January 15, 2016| 2, |
|01646125| November 2012| November 2015| 2012|
As you can see I can split the date and keep the years. However for dates like in row 3: "July 2, 2014" the result is "2, " instead of 2014.
This is my code :
split_col = fn.split(df7_ct_map['start_date'] , ' ')
df = df7_ct_map.withColumn('NAME1', split_col.getItem(0))
df = dff.withColumn('start_year', split_col.getItem(1))
Upvotes: 2
Views: 2716
Reputation: 10092
You could also extract the last 4 characters of your column start_date
.
from pyspark.sql import functions as F
df.withColumn('start_year' ,
F.expr('substring(rtrim(start_date), length(start_date) - 4,length(start_date) )' ) )
.show()
+-------------+----------+
| start_date|start_year|
+-------------+----------+
| August 2013| 2013|
| March 2009| 2009|
| July 2, 2014| 2014|
|November 2014| 2014|
+-------------+----------+
Upvotes: 2
Reputation: 28367
You could use a regular expression instead of splitting on ,
.
df.withColumn('start_year', regexp_extract(df['start_date'], '\\d{4}', 0))
This will match 4 consecutive numbers, i.e. a year.
Upvotes: 2