Reputation: 59
I have a Spark dataframe in the below format, where FamilyDetails
column is a string field:
root
|-- FirstName: string (nullable = true)
|-- LastName: string (nullable = true)
|-- FamilyDetails: string (nullable = true)
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|FirstName |LastName |FamilyDetails |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|Emma |Smith |{ |
| | | "23214598.31601190":{"gender":"F","Name":"Ms Olivia Smith","relationship":"Daughter"}, |
| | | "23214598.23214598":{"gender":"F","Name":"Ms Emma Smith","relationship":null} |
| | |} |
|Joe |Williams |{ |
| | | "2321463.2321463":{"gender":"M","Name":"Mr Joe Williams","relationship":null}, |
| | | "2321463.3841483":{"gender":"F","Name":"Mrs Sophia Williams","relationship":"Wife","IsActive":"N"} |
| | |} |
|Liam |Jones |{ |
| | | "2321464.12379942":{"gender":"F","Name":"Miss Patricia Jones","relationship":"Sister"}, |
| | | "2321464.2321464":{"gender":"M","Name":"Mr Liam Jones","relationship":null,"IsActive":"Y"} |
| | |} |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
What I am trying to do:
I am trying to get records where we have inactive family members (IsActive='N'
). Point to note that IsActive
is an optional field.
Expected output:
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
|FirstName |LastName |FamilyDetails |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+ |
|Joe |Williams |{ |
| | | "2321463.2321463":{"gender":"M","Name":"Mr Joe Williams","relationship":null}, |
| | | "2321463.3841483":{"gender":"F","Name":"Mrs Sophia Williams","relationship":"Wife","IsActive":"N"} |
| | |} |
+----------+---------+--------------------------------------------------------------------------------------------------------------------------+
What I have tried so far:
Since the complete schema is unknown, I tried to create the schema from FamilyDetails
column itself.
import org.apache.spark.sql.functions._
import spark.implicits._
val json_schema = spark.read.json(myDF.select("FamilyDetails").as[String]).schema
println(json_schema)
which gives me:
StructType(
StructField(23214598.31601190,
StructType(
StructField(gender,StringType,true),
StructField(Name,StringType,true),
StructField(relationship,StringType,true),
StructField(IsActive,StringType,true)
)
,true
)
)
How do I get rid of the first value (2321463.2321463) and take only the required fields in the json schema? Or is there any easier approach to filter records where IsActive = 'N'
?
Upvotes: 0
Views: 1137
Reputation: 42392
Perhaps you can avoid parsing the JSON by simply finding the string "IsActive":"N"
:
val df2 = df.filter("""FamilyDetails rlike '"IsActive":"N"'""")
For more rigorous parsing, you can use:
val df2 = df.filter("exists(map_values(from_json(FamilyDetails, 'map<string,map<string,string>>')), x -> x['IsActive'] = 'N')")
Upvotes: 1