Reputation: 982
I have a Spark (Python) dataframe with two columns: a user ID and then an array of arrays, which is represented in Spark as a wrapped array like so:
[WrappedArray(9, 10, 11, 12), WrappedArray(20, 21, 22, 23, 24, 25, 26)]
In its usual representation this would look like this:
[[9, 10, 11, 12], [20, 21, 22, 23, 24, 25, 26]]
I want to perform operations on each of the subarrays, for example take a third list and check whether any of its values is in the first sub-array, but I can't seem to find solutions for pyspark 2.0 (only Scala-specific older solutions like this and this).
How does one access (and in general work with) wrapped arrays? What is an efficient way to do what I described above?
Upvotes: 1
Views: 2845
Reputation: 3619
You can treat each wrapped array as individual list . in your example, if you want to which elements from 2nd wrapped array is present in first array, you could do something like -
# Prepare data
data = [[10001,[9, 10, 11, 12],[20, 10, 9, 23, 24, 25, 26]],
[10002,[8, 1, 2, 3],[49, 3, 6, 5, 6]],
]
rdd = sc.parallelize(data)
df = rdd.map(
lambda row : row+[
[x for x in row[2] if x in row[1]]
]
).toDF(["userID","array1","array2","commonElements"])
df.show()
output :
+------+---------------+--------------------+--------------+
|userID| array1| array2|commonElements|
+------+---------------+--------------------+--------------+
| 10001|[9, 10, 11, 12]|[20, 10, 9, 23, 2...| [10, 9]|
| 10002| [8, 1, 2, 3]| [49, 3, 6, 5, 6]| [3]|
+------+---------------+--------------------+--------------+
Upvotes: 1