Naveen Reddy Marthala
Naveen Reddy Marthala

Reputation: 3123

pyspark - creating a column as quarter-year from a date column

I have a dataframe like this, and format of column date is yyyy-mm-dd :

+--------+----------+---------+----------+-----------+--------------------+
|order_id|product_id|seller_id|      date|pieces_sold|       bill_raw_text|
+--------+----------+---------+----------+-----------+--------------------+
|     668|    886059|     3205|2015-01-14|         91|pbdbzvpqzqvtzxone...|
|    6608|    541277|     1917|2012-09-02|         44|cjucgejlqnmfpfcmg...|
|   12962|    613131|     2407|2016-08-26|         90|cgqhggsjmrgkrfevc...|
|   14223|    774215|     1196|2010-03-04|         46|btujmkfntccaewurg...|
|   15131|    769255|     1546|2018-11-28|         13|mrfsamfuhpgyfjgki...|
+--------+----------+---------+----------+-----------+--------------------+

I would like to create and append a column that has a letter Q, quarter, an underscore, year from date column to this dataframe. Here's what I have tried:

from pyspark.sql import functions as F
sales_table.select(
    ("Q"+F.quarter('date')+"_"+F.date_format("date", "y")).alias('quarter_year')
).show(5)

and this is what I got:

+------------+
|quarter_year|
+------------+
|        null|
|        null|
|        null|
|        null|
|        null|
+------------+

My expected output:

+---------------+
|   quarter_year|
+---------------+
|        Q1_2015|
|        Q3_2012|
|        Q3_2016|
|        Q1_2010|
|        Q4_2018|
+---------------+

I am surprised by the absence of error messages and column with nulls. How do I create this?

Upvotes: 1

Views: 2063

Answers (2)

bumblebear
bumblebear

Reputation: 95

You just need to expand your use of date_format() to apply the Quarterly formats in the Datetime patterns listed here.

from pyspark.sql.functions import col, date_format

sales_table = sales_table.withColumn("quarter_year", date_format(col("date"), "QQQ_yyyy"))

which gives..

+----------+------------+
|      date|quarter_year|
+----------+------------+
|2021-10-18|     Q4_2021|
|2021-09-25|     Q3_2021|
|2021-09-01|     Q3_2021|
|2021-09-26|     Q3_2021|
|2021-10-16|     Q4_2021|

Upvotes: 1

mck
mck

Reputation: 42352

You can't + string columns. You need to use concat. Also you need to use F.lit for literal columns of user-specified strings.

And I think there is a mistake - should be F.year('date')? not F.quarter('year')?

sales_table.select(
    F.concat(F.lit("Q"), F.quarter('date'), F.lit("_"), F.year('date')).alias('quarter_year')
).show(5)

Upvotes: 2

Related Questions