Pyspark 2.1.0 wrapped array to array

Question

I have a Spark (Python) dataframe with two columns: a user ID and then an array of arrays, which is represented in Spark as a wrapped array like so:

[WrappedArray(9, 10, 11, 12), WrappedArray(20, 21, 22, 23, 24, 25, 26)]

In its usual representation this would look like this:

[[9, 10, 11, 12], [20, 21, 22, 23, 24, 25, 26]]

I want to perform operations on each of the subarrays, for example take a third list and check whether any of its values is in the first sub-array, but I can't seem to find solutions for pyspark 2.0 (only Scala-specific older solutions like this and this).

How does one access (and in general work with) wrapped arrays? What is an efficient way to do what I described above?

Pushkr · Accepted Answer

You can treat each wrapped array as individual list . in your example, if you want to which elements from 2nd wrapped array is present in first array, you could do something like -

# Prepare data 
data = [[10001,[9, 10, 11, 12],[20, 10, 9, 23, 24, 25, 26]],
        [10002,[8, 1, 2, 3],[49, 3, 6, 5, 6]],
       ]
rdd = sc.parallelize(data) 

df = rdd.map( 
        lambda row : row+[
                          [x for x in row[2] if x in row[1]]
                         ]
           ).toDF(["userID","array1","array2","commonElements"])

df.show()

output :

+------+---------------+--------------------+--------------+
|userID|         array1|              array2|commonElements|
+------+---------------+--------------------+--------------+
| 10001|[9, 10, 11, 12]|[20, 10, 9, 23, 2...|       [10, 9]|
| 10002|   [8, 1, 2, 3]|    [49, 3, 6, 5, 6]|           [3]|
+------+---------------+--------------------+--------------+

Pyspark 2.1.0 wrapped array to array

Answers (1)

Related Questions