Reputation: 73
I have a dataframe with some columns:
+------------+--------+----------+----------+
|country_name| ID_user|birth_date| psdt|
+------------+--------+----------+----------+
| Россия|16460783| 486|1970-01-01|
| Россия|16467391| 4669|1970-01-01|
| Россия|16467889| 6861|1970-01-01|
| Казахстан|16468013| 5360|1970-01-01|
| Россия|16471027| 6311|1970-01-01|
| Россия|16474162| 5567|1970-01-01|
| Россия|16476386| 4351|1970-01-01|
| Россия|16481067| 3831|1970-01-01|
| Казахстан|16485965| -2369|1970-01-01|
| Германия|16486027| 5864|1970-01-01|
+------------+--------+----------+----------+
only showing top 10 rows
I need to add "psdt" with "birth_date". I wrote this code, but (sf.date_add) doesn't work:
resultbirthDF =(
resultDF
.select(sf.col("country_name"),
sf.col("ID_user"),
sf.col("birth_date"),
sf.lit(past_datetr).alias("psdt")
)
.withColumn("birth_datetrue",sf.date_add(sf.to_date(sf.col("psdt")),sf.col("birth_date")))
).show(10)
'Column' object is not callable
Traceback (most recent call last):
File "/volumes/disk1/yarn/local/usercache/livy/appcache/application_1573843665329_0786/container_e05_1573843665329_0786_01_000001/pyspark.zip/pyspark/sql/functions.py", line 1006, in date_add
return Column(sc._jvm.functions.date_add(_to_java_column(start), days))
How to solve this problem?
Upvotes: 7
Views: 12164
Reputation: 1954
Starting from Spark Python API 3.3.0 you can use date_add with Column types as the second parameter: https://spark.apache.org/docs/3.3.0/api/python/_modules/pyspark/sql/functions.html#date_add
So your original code in the question with not raise any error.
Upvotes: 0
Reputation: 7419
From Pyspark Documentation
pyspark.sql.functions.date_add(start, days)
Returns the date that is days days after start >>> df = spark.createDataFrame([('2015-04-08',)], ['dt']) >>> df.select(date_add(df.dt, 1).alias('next_date')).collect() [Row(next_date=datetime.date(2015, 4, 9))]
Try changing your code to sf.date_add(sf.to_date(sf.col("psdt")), 10)
and see if 10 days get added. date_add
expects the first argument to be a column and the second argument to be an integer (for the number of days you want to add to the column).
You can do exactly what you want to do without a UDF, but using a SQL expression as follows:
df.withColumn("birth_datetrue", expr("date_add(psdt, birth_date)"))
Sample output:
df.withColumn("new_dt", F.expr("date_add(dt, add)")).show()
+----------+---+----------+
| dt|add| new_dt|
+----------+---+----------+
|2015-04-08| 1|2015-04-09|
|2015-04-09| 2|2015-04-11|
|2015-04-10| 3|2015-04-13|
+----------+---+----------+
Upvotes: 15