Extracting the year from Date in Pyspark dataframe

Question

I have a Pyspark data frame that contains a date column "Reported Date"(type:string). I would like to get the count of another column after extracting the year from the date.

I can get the count if I use the string date column.

crimeFile_date.groupBy("Reported Date").sum("Offence Count").show()

and I get this output

+-------------+------------------+
|Reported Date|sum(Offence Count)|
+-------------+------------------+
|   13/08/2010|               342|
|    6/10/2011|               334|
|   27/11/2011|               269|
|   12/01/2012|               303|
|   22/02/2012|               286|
|   31/07/2012|               276|
|   25/04/2013|               222|
+-------------+------------------+

To extract the year from "Reported Date" I have converted it to a date format (using this approach) and named the column "Date". However, when I try to use the same code to group by the new column and do the count I get an error message.

crimeFile_date.groupBy(year("Date").alias("year")).sum("Offence Count").show()

TypeError: strptime() argument 1 must be str, not None

This is the data schema:

root
 |-- Offence Count: integer (nullable = true)
 |-- Reported Date: string (nullable = true)
 |-- Date: date (nullable = true)

Is there a way to fix this error? or extract the year using another method? Thank you

SMaZ · Accepted Answer

If I understand correctly then you want to extract the year from String date column. Of course, one way is using regex but sometimes it can throw your logic off if regex is not handling all scenarios.

here is the date data type approach.

Imports

import pyspark.sql.functions as f

Creating your Dataframe

l1 = [('13/08/2010',342),('6/10/2011',334),('27/11/2011',269),('12/01/2012',303),('22/02/2012',286),('31/07/2012',276),('25/04/2013',222)]
dfl1 =  spark.createDataFrame(l1).toDF("dates","sum")

dfl1.show()
+----------+---+
|     dates|sum|
+----------+---+
|13/08/2010|342|
| 6/10/2011|334|
|27/11/2011|269|
|12/01/2012|303|
|22/02/2012|286|
|31/07/2012|276|
|25/04/2013|222|
+----------+---+

Now, You can use to_timestamp or to_date apis of functions package

dfl2 = dfl1.withColumn('years',f.year(f.to_timestamp('dates', 'dd/MM/yyyy')))

dfl2.show()
+----------+---+-----+
|     dates|sum|years|
+----------+---+-----+
|13/08/2010|342| 2010|
| 6/10/2011|334| 2011|
|27/11/2011|269| 2011|
|12/01/2012|303| 2012|
|22/02/2012|286| 2012|
|31/07/2012|276| 2012|
|25/04/2013|222| 2013|
+----------+---+-----+

Now, group by on years.

dfl2.groupBy('years').sum('sum').show()
+-----+--------+                                                                
|years|sum(sum)|
+-----+--------+
| 2013|     222|
| 2012|     865|
| 2010|     342|
| 2011|     603|
+-----+--------+

Showing into multiple steps for understanding but you can combine extract year and group by in one step.

Happy to extend if you need some other help.

Extracting the year from Date in Pyspark dataframe

Answers (1)

Related Questions