PySpark Transformation on column with Json

Question

I have a column with a JSON array and I'm trying to create a new column with only a partial amount of the JSON plus some potential transforms on the json data. I'm using the following DataBricks page as a reference.

https://docs.azuredatabricks.net/_static/notebooks/transform-complex-data-types-python.html

ID	js1
1	{"a":1, "b":1}

And I want to return:

ID	js1	js2
1	[{"a":1, "b":1}]	[{"a":1}]

I'm using slightly cut down version of the sudo-method below for brevity.

def my_method(js):
    reader = spark.read
    reader.schema(schema) #Schema provided
    json = reader.json([js]) <-- Error here 

    return lit(str(json["a"]))

df.withColumn("js2", my_method(col("js1")))

The error I'm getting is Column not iterable. So how would I be able to transform the contents of the JSON method and return using withColumn a transformed block of JSON

Alex Ott · Accepted Answer

instead of accessing using [name] you need to use the map_filter function, like this (adjust the list of possible values):

df.select(map_filter(
    "js1", lambda k, _: (k == 'a') | (k == 'c')).alias("js2")
)

P.S. You can't use spark.read from inside of the user-defined function

PySpark Transformation on column with Json

Answers (1)

Related Questions