Reputation: 145

Difference of elements in PySpark list with one or two elements

I have a PySpark dataframe which has a list with either one element or two elements. When there are two elements in the list, they are not ordered by ascending or descending orders.

+--------+----------+-------+
| version| timestamp| list  |
+--------+-----+----|-------+
| v1     |2012-01-10| [5,2] |
| v1     |2012-01-11| [2,5] |
| v1     |2012-01-12| [3,2] |
| v2     |2012-01-12| [2]   |
| v2     |2012-01-11| [1,2] |
| v2     |2012-01-13| [1]   |
+--------+----------+-------+

I want to take difference betweeen the first and the second elements of the list (when there are two elements) and have that as another column (diff). When there is only one element in the list, I want to put zero in output. Here is an example of the output that I want.

+--------+----------+-------+-------+
| version| timestamp| list  |  diff | 
+--------+-----+----|-------+-------+
| v1     |2012-01-10| [5,2] |   3   |
| v1     |2012-01-11| [2,5] |  -3   |
| v1     |2012-01-12| [3,2] |   1   |
| v2     |2012-01-12| [2]   |   0   |
| v2     |2012-01-11| [1,2] |  -1   |
| v2     |2012-01-13| [1]   |   0   |
+--------+----------+-------+-------+

My question is similar to this question which I asked before, but is not exactly the same.

How can I do this using PySpark?

I am also open to the use of UDFs to get my intended output in case one needs that.

Approaches without UDF and those which are based on UDF are both welcome. Thanks.

Upvotes: 1

Answers (2)

CPak

Reputation: 13581

You could also define a udf like so

Example data

data = [
    ('v1', [5, 2],),
    ('v1', [2, 5],),
    ('v1', [3, 2],),
    ('v2', [2],),
    ('v2', [1, 2],),
    ('v2', [1],),
]
df = spark.createDataFrame(data, ['version', 'list'])

Solution

from functools import reduce
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

# UDF definition
find_diff = udf(lambda a: reduce(lambda x, y: x - y, a), IntegerType()) 

(
    df.
        withColumn(
            'diff',
            find_diff('list')
        ).
        show(truncate=False)
)

+-------+------+----+                                                           
|version|list  |diff|
+-------+------+----+
|v1     |[5, 2]|3   |
|v1     |[2, 5]|-3  |
|v1     |[3, 2]|1   |
|v2     |[2]   |2   |
|v2     |[1, 2]|-1  |
|v2     |[1]   |1   |
+-------+------+----+

Upvotes: 0

murtihash

Reputation: 8410

Adding to @Shu's answer from earlier, just add a when/otherwise clause to that by checking size of array.

df.withColumn("diff", F.when(F.size('list')==2, F.expr("""transform(array(list),x-> x[0]-x[1])""")[0])\
                       .otherwise(F.lit(0))).show()

#+------+----+
#|  list|diff|
#+------+----+
#|[5, 2]|   3|
#|[2, 5]|  -3|
#|   [2]|   0|
#+------+----+

Upvotes: 2

Difference of elements in PySpark list with one or two elements

Answers (2)

Example data

Solution

Related Questions