A way to infer json data scheme in Spark

Question

Let's say that i have dataframe with column data. In this column i have a string with json inside. The trick is that that json is not always full, some attributes may be missing in some rows.

See sample below for a clarity

column_name_placeholder | data
foo                      {"attr1":1}
foo                      {"attr2":2}
bar                      {"attr0":"str"}
bar                      {"attr3":"po"}

What i'm looking for is a way to infer full json schema for an each key in "column_name_placeholder"

so the answer would be something like this

foo
{
"attr1":int,
"attr2":int
}
bar
{
"attr0":string,
"attr3":string
}

Only way i imaging i can do that is to go down to RDD level and infer schema with some kind of 3rd party library on map stage and merge that schema again with some 3rd party library on reduce stage

Am i missing some spark *magic?

A way to infer json data scheme in Spark

Answers (1)

Related Questions