Reputation: 563
I have a Pyspark data frame that contains a date column "Reported Date"(type:string). I would like to get the count of another column after extracting the year from the date.
I can get the count if I use the string date column.
crimeFile_date.groupBy("Reported Date").sum("Offence Count").show()
and I get this output
+-------------+------------------+
|Reported Date|sum(Offence Count)|
+-------------+------------------+
| 13/08/2010| 342|
| 6/10/2011| 334|
| 27/11/2011| 269|
| 12/01/2012| 303|
| 22/02/2012| 286|
| 31/07/2012| 276|
| 25/04/2013| 222|
+-------------+------------------+
To extract the year from "Reported Date" I have converted it to a date format (using this approach) and named the column "Date". However, when I try to use the same code to group by the new column and do the count I get an error message.
crimeFile_date.groupBy(year("Date").alias("year")).sum("Offence Count").show()
TypeError: strptime() argument 1 must be str, not None
This is the data schema:
root
|-- Offence Count: integer (nullable = true)
|-- Reported Date: string (nullable = true)
|-- Date: date (nullable = true)
Is there a way to fix this error? or extract the year using another method? Thank you
Upvotes: 1
Views: 23715
Reputation: 2655
If I understand correctly then you want to extract the year from String date column. Of course, one way is using regex but sometimes it can throw your logic off if regex is not handling all scenarios.
here is the date data type approach.
Imports
import pyspark.sql.functions as f
Creating your Dataframe
l1 = [('13/08/2010',342),('6/10/2011',334),('27/11/2011',269),('12/01/2012',303),('22/02/2012',286),('31/07/2012',276),('25/04/2013',222)]
dfl1 = spark.createDataFrame(l1).toDF("dates","sum")
dfl1.show()
+----------+---+
| dates|sum|
+----------+---+
|13/08/2010|342|
| 6/10/2011|334|
|27/11/2011|269|
|12/01/2012|303|
|22/02/2012|286|
|31/07/2012|276|
|25/04/2013|222|
+----------+---+
Now, You can use to_timestamp or to_date apis of functions package
dfl2 = dfl1.withColumn('years',f.year(f.to_timestamp('dates', 'dd/MM/yyyy')))
dfl2.show()
+----------+---+-----+
| dates|sum|years|
+----------+---+-----+
|13/08/2010|342| 2010|
| 6/10/2011|334| 2011|
|27/11/2011|269| 2011|
|12/01/2012|303| 2012|
|22/02/2012|286| 2012|
|31/07/2012|276| 2012|
|25/04/2013|222| 2013|
+----------+---+-----+
Now, group by on years.
dfl2.groupBy('years').sum('sum').show()
+-----+--------+
|years|sum(sum)|
+-----+--------+
| 2013| 222|
| 2012| 865|
| 2010| 342|
| 2011| 603|
+-----+--------+
Showing into multiple steps for understanding but you can combine extract year and group by in one step.
Happy to extend if you need some other help.
Upvotes: 4