user2538559
user2538559

Reputation: 59

Filter based on JSON data which is in a string column in a Spark dataframe

I have a Spark dataframe in the below format, where FamilyDetails column is a string field:

root
 |-- FirstName: string (nullable = true)
 |-- LastName: string (nullable = true)
 |-- FamilyDetails: string (nullable = true)

+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|FirstName |LastName |FamilyDetails                                                                                                             |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|Emma      |Smith    |{                                                                                                                         |
|          |         | "23214598.31601190":{"gender":"F","Name":"Ms Olivia Smith","relationship":"Daughter"},                                   |
|          |         | "23214598.23214598":{"gender":"F","Name":"Ms Emma Smith","relationship":null}                                            |
|          |         |}                                                                                                                         |
|Joe       |Williams |{                                                                                                                         |
|          |         |  "2321463.2321463":{"gender":"M","Name":"Mr Joe Williams","relationship":null},                                          |
|          |         |  "2321463.3841483":{"gender":"F","Name":"Mrs Sophia Williams","relationship":"Wife","IsActive":"N"}                      |
|          |         |}                                                                                                                         |
|Liam      |Jones    |{                                                                                                                         |
|          |         |  "2321464.12379942":{"gender":"F","Name":"Miss Patricia Jones","relationship":"Sister"},                                 |
|          |         |  "2321464.2321464":{"gender":"M","Name":"Mr Liam Jones","relationship":null,"IsActive":"Y"}                              |
|          |         |}                                                                                                                         |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+

What I am trying to do:

I am trying to get records where we have inactive family members (IsActive='N'). Point to note that IsActive is an optional field.

Expected output:

+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|FirstName |LastName |FamilyDetails                                                                                                             |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+                                                                                                                   |
|Joe       |Williams |{                                                                                                                         |
|          |         |  "2321463.2321463":{"gender":"M","Name":"Mr Joe Williams","relationship":null},                                          |
|          |         |  "2321463.3841483":{"gender":"F","Name":"Mrs Sophia Williams","relationship":"Wife","IsActive":"N"}                      |
|          |         |}                                                                                                                         |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+

What I have tried so far:

Since the complete schema is unknown, I tried to create the schema from FamilyDetails column itself.

import org.apache.spark.sql.functions._
import spark.implicits._
val json_schema = spark.read.json(myDF.select("FamilyDetails").as[String]).schema
println(json_schema)

which gives me:

StructType(
    StructField(23214598.31601190,
        StructType(
            StructField(gender,StringType,true), 
            StructField(Name,StringType,true), 
            StructField(relationship,StringType,true), 
            StructField(IsActive,StringType,true)
        )
        ,true
    )
)

How do I get rid of the first value (2321463.2321463) and take only the required fields in the json schema? Or is there any easier approach to filter records where IsActive = 'N'?

Upvotes: 0

Views: 1137

Answers (1)

mck
mck

Reputation: 42392

Perhaps you can avoid parsing the JSON by simply finding the string "IsActive":"N":

val df2 = df.filter("""FamilyDetails rlike '"IsActive":"N"'""")

For more rigorous parsing, you can use:

val df2 = df.filter("exists(map_values(from_json(FamilyDetails, 'map<string,map<string,string>>')), x -> x['IsActive'] = 'N')")

Upvotes: 1

Related Questions