Parsing Nested JSON column in Spark

Question

I have a below sample data,as a dataframe

id|data
1,{"a":["xyz","x"],"b":["abc","ab"]}
2,{"a":["zy","y"],"b":["ac","bc"]}

This is my expected output:

id|details
1,abc
2,ac

This is what I have come with till now,

dataframe.withColumn("details",from_json($"data",StructType(Seq(StructField("b",ArrayType(StringType),true)))))

This is the output I am getting right now

1,[abc,ab]
2,[ac,bc]

Also, this is just a sample, I have more than 100 columns so cannot use select clause. Want to resolve this using withColumn.

Grisha Weintraub · Accepted Answer

Assuming you want to have the first element of the b array, based on details column you have already computed, simply select the first element of b:

 dataframe.withColumn("details", expr("details.b[0]"))

You also can use getField and getItem functions directly on the result of from_json

dataframe.withColumn("details",from_json($"data",StructType(Seq(StructField("b",ArrayType(StringType),true)))).getField("b").getItem(0))

Parsing Nested JSON column in Spark

Answers (1)

Related Questions