Reputation: 3
I would like to parse and get the value of specific key from the PySpark SQL dataframe with the below format
I could able to achieve this with UDF but it takes almost 20 mins to process 40 columns with the JSON size of 100MB. Tried explode as well but it gives seperate rows for each array element. but i need only the specific value of the key in a given array of struct.
Format
array<struct<key:string,value:struct<int_value:string,string_value:string>>>
def getValueFunc(searcharray, searchkey):
for val in searcharray:
if val["key"] == searchkey:
if val["value"]["string_value"] is not None:
actual = val["value"]["string_value"]
return actual
elif val["value"]["int_value"] is not None:
actual = val["value"]["int_value"]
return str(actual)
else:
return "---"
.....
getValue = udf(getValueFunc, StringType())
....
# register the name rank udf template
spark.udf.register("getValue", getValue)
.....
df.select(getValue(col("event_params"), lit("category")).alias("event_category"))
Upvotes: 0
Views: 1323
Reputation: 13998
For Spark 2.40+, you can use SparkSQL's filter() function to find the first array element which matches key == serarchkey
and then retrieve its value. Below is a Spark SQL snippet template(searchkey as a variable) to do the first part mentioned above.
stmt = '''filter(event_params, x -> x.key == "{}")[0]'''.format(searchkey)
Run the above stmt with expr()
function, and assign the value (StructType) to a temporary column f1
, and then use coalesce()
function to retrieve the non-null value.
from pyspark.sql.functions import expr
df.withColumn('f1', expr(stmt)) \
.selectExpr("coalesce(f1.value.string_value, string(f1.value.int_value),'---') AS event_category") \
.show()
Let me know if you have any problem running the above code.
Upvotes: 1