Reputation: 125
I have a table where there is 1 column which is serialized JSON. I want to apply schema inference on this JSON column. I don't know schema to pass as input for JSON extraction (e.g: from_json function).
I can do this in Scala like
val contextSchema = spark.read.json(data.select("context").as[String]).schema
val updatedData = data.withColumn("context", from_json(col("context"), contextSchema))
How can I transform this solution to a pure Spark-SQL?
Upvotes: 6
Views: 8615
Reputation: 814
I found a workaround by
- converting it to RDD and read with spark dataframe
spark
.read
.option("inferSchema", True)
.json(
df.rdd.map(
lambda rec: rec.context
)
)
- if field/path is known beforehand, we can use
df.select(json_tuple(col("context"),"<json path or attribute>").alias("field_name")).show()
Upvotes: 0
Reputation: 5487
You can use schema_of_json() function to infer JSON schema.
select from_json(<column_name>, schema_of_json(<sample_JSON>)) from <table>
Upvotes: 1
Reputation: 31510
For spark-sql
use toDDL
to generate schema then use the schema in from_json
.
Example:
df.show(10,false)
//+---+-------------------+
//|seq|json |
//+---+-------------------+
//|1 |{"id":1,"name":"a"}|
//+---+-------------------+
val sch=spark.read.json(df.select("json").as[String]).schema.toDDL
//sch: String = `id` BIGINT,`name` STRING
df.createOrReplaceTempView("tmp")
spark.sql(s"""select seq,jsn.* from (select *,from_json(json,"$sch") as jsn from tmp)""").
show(10,false)
//+---+---+----+
//|seq|id |name|
//+---+---+----+
//|1 |1 |a |
//+---+---+----+
Upvotes: 2