user10853036
user10853036

Reputation: 145

Difference of elements in list in PySpark

I have a PySpark dataframe (df) with a column which contains lists with two elements. The two elements in the list are not ordered by ascending or descending orders.

+--------+----------+-------+
| version| timestamp| list  |
+--------+-----+----|-------+
| v1     |2012-01-10| [5,2] |
| v1     |2012-01-11| [2,5] |
| v1     |2012-01-12| [3,2] |
| v2     |2012-01-12| [2,3] |
| v2     |2012-01-11| [1,2] |
| v2     |2012-01-13| [2,1] |
+--------+----------+-------+

I want to take difference betweeen the first and the second elements of the list and have that as another column (diff). Here is an example of the output that I want.

+--------+----------+-------+-------+
| version| timestamp| list  |  diff | 
+--------+-----+----|-------+-------+
| v1     |2012-01-10| [5,2] |   3   |
| v1     |2012-01-11| [2,5] |  -3   |
| v1     |2012-01-12| [3,2] |   1   |
| v2     |2012-01-12| [2,3] |  -1   |
| v2     |2012-01-11| [1,2] |  -1   |
| v2     |2012-01-13| [2,1] |   1   |
+--------+----------+-------+-------+

How can I do this using PySpark?

I tried the following:

transform_expr = (
        "transform(diff, x-y ->"
        + "x as list[0], y as list[1])"
    )

df = df.withColumn("diff", F.expr(transform_expr)) 

But, the above technique did not give me any output.

I am also open to the use of UDFs to get my intended output in case one needs that.

Approaches without UDF and those which are based on UDF are both welcome. Thanks.

Upvotes: 1

Views: 1113

Answers (1)

notNull
notNull

Reputation: 31460

There are multiple ways to do this, you can use any of element_at (Spark 2.4 or newer), transform, array index[0] or .getItem() to get the difference.

#sample dataframe
df=spark.createDataFrame([([5,2],),([2,5],)],["list"])

#using element_at
df.withColumn("diff",element_at(col("list"),1) - element_at(col("list"),2)).show()

#using transform 
df.withColumn("diff",concat_ws("",expr("""transform(array(list),x -> x[0] - x[1])"""))).show()

#using array index
df.withColumn("diff",col("list")[0]- col("list")[1]).show()

#using .getItem
df.withColumn("diff",col("list").getItem(0)- col("list").getItem(1)).show()

#+------+----+
#|  list|diff|
#+------+----+
#|[5, 2]|   3|
#|[2, 5]|  -3|
#+------+----+

Upvotes: 5

Related Questions