AnonX
AnonX

Reputation: 261

Get index of column item that is in an array in another column in a Spark dataframe

I have a data frame that looks like this:

+-------+-------+--------------------+
|   user|   item|        ls_rec_items|
+-------+-------+--------------------+
|    321|      3|  [4, 3, 2, 6, 1, 5]|
|    123|      2|  [5, 6, 3, 1, 2, 4]|
|    123|      7|  [5, 6, 3, 1, 2, 4]|
+-------+-------+--------------------+

I want to know in which position the "item" is in the "ls_rec_items" array.
I know the function array_position, but I don't know how to get the "item" value there.

I know this:
df.select(F.array_position(df.ls_rec_items, 3)).collect()
But I want this:
df.select(F.array_position(df.ls_rec_items, df.item)).collect()

The output should look like this:

+-------+-------+--------------------+-----+
|   user|   item|        ls_rec_items|  pos|
+-------+-------+--------------------+-----+
|    321|      3|  [4, 3, 2, 6, 1, 5]|    2|
|    123|      2|  [5, 6, 3, 1, 2, 4]|    5|
|    123|      7|  [5, 6, 3, 1, 2, 4]|    0|
+-------+-------+--------------------+-----+

Upvotes: 3

Views: 3000

Answers (1)

vladsiv
vladsiv

Reputation: 2936

You could use expr with array_position like this:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

if __name__ == "__main__":
    spark = SparkSession.builder.getOrCreate()
    data = [
        {"user": 321, "item": 3, "ls_rec_items": [4, 3, 2, 6, 1, 5]},
        {"user": 123, "item": 2, "ls_rec_items": [5, 6, 3, 1, 2, 4]},
        {"user": 123, "item": 7, "ls_rec_items": [5, 6, 3, 1, 2, 4]},
    ]
    df = spark.createDataFrame(data)
    df = df.withColumn("pos", F.expr("array_position(ls_rec_items, item)"))

Result

+----+------------------+----+---+                                              
|item|      ls_rec_items|user|pos|
+----+------------------+----+---+
|   3|[4, 3, 2, 6, 1, 5]| 321|  2|
|   2|[5, 6, 3, 1, 2, 4]| 123|  5|
|   7|[5, 6, 3, 1, 2, 4]| 123|  0|
+----+------------------+----+---+

Upvotes: 4

Related Questions