Reputation: 41
I've a pyspark dataframe with below schema:
root
|-- channelGrouping: string (nullable = false)
|-- clientId: string (nullable = false)
|-- customDimensions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- index: long (nullable = true)
| | |-- value: string (nullable = true)
I want to fetch all those records for which the index "2" of customDimensions matches regular expression r'b'.
Sample Input
+--------+---------------+--------------------------+
|clientId|channelGrouping| customDimensions |
+--------+---------------+--------------------------+
| 123 | alpha | [[1, a], [2, b] |
| 456 | beta | [[1, a], [2, b], [3, c] |
| 789 | gama | [[1, a], [2, a], [3, c] |
+--------+---------------+--------------------------+
Expected output:
+--------+---------------+--------------------------+
|clientId|channelGrouping| customDimensions |
+--------+---------------+--------------------------+
| 123 | alpha | [[1, a], [2, b] |
| 456 | beta | [[1, a], [2, b], [3, c] |
+--------+---------------+--------------------------+
In bigquery this can be achieved using below query:
select
*
from
some_table
where
REGEXP_CONTAINS((SELECT value FROM UNNEST(customDimensions) WHERE index=2), r'b')
Upvotes: 2
Views: 809
Reputation: 1117
Here is an alternative way of using SQL strings directly within where
:
df1 = df.where("element_at(map_from_entries(customDimensions), 2) rlike 'b'")
Upvotes: 0
Reputation: 41
Managed to write a solution:
from pyspark.sql.functions import *
df = spark.read.json("some_path")
df1 = df.where(element_at(map_from_entries(col("customDimensions")), 2).rlike('b'))
Upvotes: 2