tty6
tty6

Reputation: 1233

A way to infer json data scheme in Spark

Let's say that i have dataframe with column data. In this column i have a string with json inside. The trick is that that json is not always full, some attributes may be missing in some rows.

See sample below for a clarity

column_name_placeholder | data
foo                      {"attr1":1}
foo                      {"attr2":2}
bar                      {"attr0":"str"}
bar                      {"attr3":"po"}

What i'm looking for is a way to infer full json schema for an each key in "column_name_placeholder"

so the answer would be something like this

foo
{
"attr1":int,
"attr2":int
}
bar
{
"attr0":string,
"attr3":string
}

Only way i imaging i can do that is to go down to RDD level and infer schema with some kind of 3rd party library on map stage and merge that schema again with some 3rd party library on reduce stage

Am i missing some spark *magic?

Upvotes: 0

Views: 1110

Answers (1)

blackbishop
blackbishop

Reputation: 32690

You can transform into RDD and read again using spark.read.json and let it infer the schema.

Example for column_name_placeholder = bar:

spark.read.json(
    df.filter("column_name_placeholder = 'bar'").rdd.map(lambda row: row.data)
).printSchema()

#root
# |-- attr0: string (nullable = true)
# |-- attr3: string (nullable = true)

Upvotes: 2

Related Questions