user17271845
user17271845

Reputation: 15

Pyspark: Get index of array element based on substring

I have the following dataframe, that contains a column of arrays (col1). I need to get the index of the element that contains a certain substring ("58=").

+-----------------------------------------------------------+-----+
|                                                      col1 |a_pos|
+-----------------------------------------------------------+-----+
|[8=FIX.4.4, 55=ITUBD264, 58=AID[43e39b2e-c6e2-4947]        |    0|
+-----------------------------------------------------------+-----+

I've tried to use array_position(col1, "58="), but it seems it only works with the exact match and not substrings.

In Python i'm doing exactly this, but in pandas, by using the following code:

df['idx'] = [max(range(len(l)), key=lambda x: '58=' in l[x]) for l in df['col1']]

Upvotes: 0

Views: 673

Answers (1)

wwnde
wwnde

Reputation: 26676

Check existence of 58 using the rlike function in a higher order function. Determine position using array_position. Code below

df = df.withColumn('index',expr("array_position(transform(col1, x-> rlike(x,58)),true)")).show(truncate=False)
+---------------------------------------------------+-----+-----+
|col1                                               |a_pos|index|
+---------------------------------------------------+-----+-----+
|[8=FIX.4.4, 55=ITUBD264, 58=AID[43e39b2e-c6e2-4947]|0    |3    |
+---------------------------------------------------+-----+-----+

Upvotes: 1

Related Questions