PySpark Parsing nested array of struct

Question

I would like to parse and get the value of specific key from the PySpark SQL dataframe with the below format

I could able to achieve this with UDF but it takes almost 20 mins to process 40 columns with the JSON size of 100MB. Tried explode as well but it gives seperate rows for each array element. but i need only the specific value of the key in a given array of struct.

Format

array>>

Function to get a specific key values

def getValueFunc(searcharray, searchkey):
    for val in searcharray:
        if val["key"] == searchkey:
            if val["value"]["string_value"] is not None:
                actual = val["value"]["string_value"]
                return actual
            elif val["value"]["int_value"] is not None:
                actual = val["value"]["int_value"]
                return str(actual)
            else:
                return "---"

.....
getValue = udf(getValueFunc, StringType())
....
# register the name rank udf template
spark.udf.register("getValue", getValue)
.....
df.select(getValue(col("event_params"), lit("category")).alias("event_category"))

jxc · Accepted Answer

For Spark 2.40+, you can use SparkSQL's filter() function to find the first array element which matches key == serarchkey and then retrieve its value. Below is a Spark SQL snippet template(searchkey as a variable) to do the first part mentioned above.

stmt = '''filter(event_params, x -> x.key == "{}")[0]'''.format(searchkey)

Run the above stmt with expr() function, and assign the value (StructType) to a temporary column f1, and then use coalesce() function to retrieve the non-null value.

from pyspark.sql.functions import expr

df.withColumn('f1', expr(stmt)) \
    .selectExpr("coalesce(f1.value.string_value, string(f1.value.int_value),'---') AS event_category") \
    .show()

Let me know if you have any problem running the above code.

PySpark Parsing nested array of struct

Function to get a specific key values

Answers (1)

Related Questions