Reputation: 1233
Let's say that i have dataframe with column data
.
In this column i have a string with json inside.
The trick is that that json is not always full, some attributes may be missing in some rows.
See sample below for a clarity
column_name_placeholder | data
foo {"attr1":1}
foo {"attr2":2}
bar {"attr0":"str"}
bar {"attr3":"po"}
What i'm looking for is a way to infer full json schema for an each key in "column_name_placeholder"
so the answer would be something like this
foo
{
"attr1":int,
"attr2":int
}
bar
{
"attr0":string,
"attr3":string
}
Only way i imaging i can do that is to go down to RDD level and infer schema with some kind of 3rd party library on map stage and merge that schema again with some 3rd party library on reduce stage
Am i missing some spark *magic?
Upvotes: 0
Views: 1110
Reputation: 32690
You can transform into RDD and read again using spark.read.json
and let it infer the schema.
Example for column_name_placeholder = bar
:
spark.read.json(
df.filter("column_name_placeholder = 'bar'").rdd.map(lambda row: row.data)
).printSchema()
#root
# |-- attr0: string (nullable = true)
# |-- attr3: string (nullable = true)
Upvotes: 2