Naveen Reddy Marthala
Naveen Reddy Marthala

Reputation: 3123

pyspark - can't get quarter and week of year from date column

I have a pyspark dataframe that looks like this:

+--------+----------+---------+----------+-----------+--------------------+
|order_id|product_id|seller_id|      date|pieces_sold|       bill_raw_text|
+--------+----------+---------+----------+-----------+--------------------+
|     668|    886059|     3205|2015-01-14|         91|pbdbzvpqzqvtzxone...|
|    6608|    541277|     1917|2012-09-02|         44|cjucgejlqnmfpfcmg...|
|   12962|    613131|     2407|2016-08-26|         90|cgqhggsjmrgkrfevc...|
|   14223|    774215|     1196|2010-03-04|         46|btujmkfntccaewurg...|
|   15131|    769255|     1546|2018-11-28|         13|mrfsamfuhpgyfjgki...|
|   15625|     86357|     2455|2008-04-18|         50|wlwsliatrrywqjrih...|
|   18470|     26238|      295|2009-03-06|         86|zrfdpymzkgbgdwFwz...|
|   29883|    995036|     4596|2009-10-25|         86|oxcutwmqgmioaelsj...|
|   38428|    193694|     3826|2014-01-26|         82|yonksvwhrfqkytypr...|
|   41023|    949332|     4158|2014-09-03|         83|hubxhfdtxrqsfotdq...|
+--------+----------+---------+----------+-----------+--------------------+

I would like to create two columns, one with quarter of year and another with week number of year. Here's what I did, referring documentation for weekofyear and quarter:

from pyspark.sql import functions as F
sales_table = sales_table.withColumn("week_year", F.date_format(F.to_date("date", "yyyy-mm-dd"),
                                                                F.weekofyear("d")))
sales_table = sales_table.withColumn("quarter", F.date_format(F.to_date("date", "yyyy-mm-dd"),
                                                              F.quarter("d")))
sales_table.show(10)

And here's the error:

Column is not iterable
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/functions.py", line 945, in date_format
    return Column(sc._jvm.functions.date_format(_to_java_column(date), format))
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1296, in __call__
    args_command, temp_args = self._build_args(*args)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1260, in _build_args
    (new_args, temp_args) = self._get_args(args)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1247, in _get_args
    temp_arg = converter.convert(arg, self.gateway_client)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 510, in convert
    for element in object:
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/column.py", line 353, in __iter__
    raise TypeError("Column is not iterable")
TypeError: Column is not iterable

How do I create and append those two columns?

And is there a better or efficient way to create those columns, instead of having to convert the date column to yyyy-mm-dd format everytime and creating those two columns in one command?

Upvotes: 1

Views: 6993

Answers (2)

mck
mck

Reputation: 42352

You can just use the functions on the string column date directly.

from pyspark.sql import functions as F

df = df.select(
    '*',
    F.weekofyear('date').alias('week_year'), 
    F.quarter('date').alias('quarter')
)
df.show()

+--------+----------+---------+----------+-----------+--------------------+---------+-------+
|order_id|product_id|seller_id|      date|pieces_sold|       bill_raw_text|week_year|quarter|
+--------+----------+---------+----------+-----------+--------------------+---------+-------+
|     668|    886059|     3205|2015-01-14|         91|pbdbzvpqzqvtzxone...|        3|      1|
|    6608|    541277|     1917|2012-09-02|         44|cjucgejlqnmfpfcmg...|       35|      3|
|   12962|    613131|     2407|2016-08-26|         90|cgqhggsjmrgkrfevc...|       34|      3|
|   14223|    774215|     1196|2010-03-04|         46|btujmkfntccaewurg...|        9|      1|
|   15131|    769255|     1546|2018-11-28|         13|mrfsamfuhpgyfjgki...|       48|      4|
|   15625|     86357|     2455|2008-04-18|         50|wlwsliatrrywqjrih...|       16|      2|
|   18470|     26238|      295|2009-03-06|         86|zrfdpymzkgbgdwFwz...|       10|      1|
|   29883|    995036|     4596|2009-10-25|         86|oxcutwmqgmioaelsj...|       43|      4|
|   38428|    193694|     3826|2014-01-26|         82|yonksvwhrfqkytypr...|        4|      1|
|   41023|    949332|     4158|2014-09-03|         83|hubxhfdtxrqsfotdq...|       36|      3|
+--------+----------+---------+----------+-----------+--------------------+---------+-------+

Upvotes: 6

notNull
notNull

Reputation: 31490

You don't have to use date_format function here as you have already date in yyyy-MM-dd format directly use week_of_year and quarter on date column.

Example:

df.show()
#+----------+
#|      date|
#+----------+
#|2015-01-14|
#+----------+
from pyspark.sql import functions as F

df.withColumn("week_year", F.weekofyear(F.col("date"))).\
withColumn("quarter", F.quarter(F.col("date"))).\
show()
#+----------+---------+-------+
#|      date|week_year|quarter|
#+----------+---------+-------+
#|2015-01-14|        3|      1|
#+----------+---------+-------+

Upvotes: 2

Related Questions