Reputation: 5165
I have a below dataframe
dataDictionary = [('value1', [{'key': 'Fruit', 'value': 'Apple'}, {'key': 'Colour', 'value': 'White'}]),
('value2', [{'key': 'Fruit', 'value': 'Mango'}, {'key': 'Bird', 'value': 'Eagle'}, {'key': 'Colour', 'value': 'Black'}])]
df = spark.createDataFrame(data=dataDictionary)
df.printSchema()
df.show(truncate=False)
+------+------------------------------------------------------------------------------------------------+
|_1 |_2 |
+------+------------------------------------------------------------------------------------------------+
|value1|[{value -> Apple, key -> Fruit}, {value -> White, key -> Colour}] |
|value2|[{value -> Mango, key -> Fruit}, {value -> Eagle, key -> Bird}, {value -> Black, key -> Colour}]|
+------+------------------------------------------------------------------------------------------------+
I wanted to extract only the values of key -> Colour. The result should be,
White
Black
I have tried multiple options using regexp_extract_all
as well as substring
with instr
, result is always an empty value. What can I try next?
result = spark.sql("""select
regexp_extract('_2', '''key': 'Colour' + '(\\w+)') as value
from table
""")
Upvotes: 0
Views: 40
Reputation: 4234
It is preferred to use pyspark built-in functions, which have guaranteed performance and convenience.
First, use the filter
function to obtain the map whose key is Colour
in the _2 column (array type), then take its first (index 0) element, and finally get the value whose key is value
in the map.
from pyspark.sql import SparkSession, functions as F
...
df = df.select('_1', F.filter('_2', lambda x: x['key'] == 'Colour')[0]['value'])
Upvotes: 1