Reputation: 145
I have a PySpark dataframe which has a list with either one element or two elements. When there are two elements in the list, they are not ordered by ascending or descending orders.
+--------+----------+-------+
| version| timestamp| list |
+--------+-----+----|-------+
| v1 |2012-01-10| [5,2] |
| v1 |2012-01-11| [2,5] |
| v1 |2012-01-12| [3,2] |
| v2 |2012-01-12| [2] |
| v2 |2012-01-11| [1,2] |
| v2 |2012-01-13| [1] |
+--------+----------+-------+
I want to take difference betweeen the first and the second elements of the list (when there are two elements) and have that as another column (diff
). When there is only one element in the list, I want to put zero in output. Here is an example of the output that I want.
+--------+----------+-------+-------+
| version| timestamp| list | diff |
+--------+-----+----|-------+-------+
| v1 |2012-01-10| [5,2] | 3 |
| v1 |2012-01-11| [2,5] | -3 |
| v1 |2012-01-12| [3,2] | 1 |
| v2 |2012-01-12| [2] | 0 |
| v2 |2012-01-11| [1,2] | -1 |
| v2 |2012-01-13| [1] | 0 |
+--------+----------+-------+-------+
My question is similar to this question which I asked before, but is not exactly the same.
How can I do this using PySpark?
I am also open to the use of UDFs to get my intended output in case one needs that.
Approaches without UDF and those which are based on UDF are both welcome. Thanks.
Upvotes: 1
Views: 328
Reputation: 13581
You could also define a udf
like so
data = [
('v1', [5, 2],),
('v1', [2, 5],),
('v1', [3, 2],),
('v2', [2],),
('v2', [1, 2],),
('v2', [1],),
]
df = spark.createDataFrame(data, ['version', 'list'])
from functools import reduce
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
# UDF definition
find_diff = udf(lambda a: reduce(lambda x, y: x - y, a), IntegerType())
(
df.
withColumn(
'diff',
find_diff('list')
).
show(truncate=False)
)
+-------+------+----+
|version|list |diff|
+-------+------+----+
|v1 |[5, 2]|3 |
|v1 |[2, 5]|-3 |
|v1 |[3, 2]|1 |
|v2 |[2] |2 |
|v2 |[1, 2]|-1 |
|v2 |[1] |1 |
+-------+------+----+
Upvotes: 0
Reputation: 8410
Adding to @Shu's
answer from earlier, just add a when/otherwise
clause to that by checking size
of array.
df.withColumn("diff", F.when(F.size('list')==2, F.expr("""transform(array(list),x-> x[0]-x[1])""")[0])\
.otherwise(F.lit(0))).show()
#+------+----+
#| list|diff|
#+------+----+
#|[5, 2]| 3|
#|[2, 5]| -3|
#| [2]| 0|
#+------+----+
Upvotes: 2