Harshit
Harshit

Reputation: 125

How to infer schema of serialized JSON column in Spark SQL?

I have a table where there is 1 column which is serialized JSON. I want to apply schema inference on this JSON column. I don't know schema to pass as input for JSON extraction (e.g: from_json function).

I can do this in Scala like

val contextSchema = spark.read.json(data.select("context").as[String]).schema
val updatedData = data.withColumn("context", from_json(col("context"), contextSchema))

How can I transform this solution to a pure Spark-SQL?

Upvotes: 6

Views: 8615

Answers (3)

Vengleab SO
Vengleab SO

Reputation: 814

I found a workaround by

  1. converting it to RDD and read with spark dataframe
spark
  .read
  .option("inferSchema", True)
  .json(
      df.rdd.map(
          lambda rec: rec.context
      )
  )
  1. if field/path is known beforehand, we can use
df.select(json_tuple(col("context"),"<json path or attribute>").alias("field_name")).show()

Upvotes: 0

Mohana B C
Mohana B C

Reputation: 5487

You can use schema_of_json() function to infer JSON schema.

select from_json(<column_name>, schema_of_json(<sample_JSON>)) from <table> 

Upvotes: 1

notNull
notNull

Reputation: 31510

For spark-sql use toDDL to generate schema then use the schema in from_json.

Example:

df.show(10,false)
//+---+-------------------+
//|seq|json               |
//+---+-------------------+
//|1  |{"id":1,"name":"a"}|
//+---+-------------------+

val sch=spark.read.json(df.select("json").as[String]).schema.toDDL
//sch: String = `id` BIGINT,`name` STRING

df.createOrReplaceTempView("tmp")

spark.sql(s"""select seq,jsn.* from (select *,from_json(json,"$sch") as jsn  from tmp)""").
show(10,false)
//+---+---+----+
//|seq|id |name|
//+---+---+----+
//|1  |1  |a   |
//+---+---+----+

Upvotes: 2

Related Questions