Reputation: 855
I have written a sample spark app, where I'm creating a dataframe with MapType and writing it to disk. Then I'm reading the same file & printing its schema. Bu the output file schema is different when compared to Input Schema and I don't see the MapType in the Output. How I can read that output file with MapType
Code
import org.apache.spark.sql.{SaveMode, SparkSession}
case class Department(Id:String,Description:String)
case class Person(name:String,department:Map[String,Department])
object sample {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.master("local").appName("Custom Poc").getOrCreate
import spark.implicits._
val schemaData = Seq(
Person("Persion1", Map("It" -> Department("1", "It Department"), "HR" -> Department("2", "HR Department"))),
Person("Persion2", Map("It" -> Department("1", "It Department")))
)
val df = spark.sparkContext.parallelize(schemaData).toDF()
println("Input schema")
df.printSchema()
df.write.mode(SaveMode.Overwrite).json("D:\\save\\output")
println("Output schema")
spark.read.json("D:\\save\\output\\*.json").printSchema()
}
}
OutPut
Input schema
root
|-- name: string (nullable = true)
|-- department: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- Id: string (nullable = true)
| | |-- Description: string (nullable = true)
Output schema
root
|-- department: struct (nullable = true)
| |-- HR: struct (nullable = true)
| | |-- Description: string (nullable = true)
| | |-- Id: string (nullable = true)
| |-- It: struct (nullable = true)
| | |-- Description: string (nullable = true)
| | |-- Id: string (nullable = true)
|-- name: string (nullable = true)
Json File
{"name":"Persion1","department":{"It":{"Id":"1","Description":"It Department"},"HR":{"Id":"2","Description":"HR Department"}}}
{"name":"Persion2","department":{"It":{"Id":"1","Description":"It Department"}}}
EDIT : For just explaining my requirement I have added the saving file part above. In actual scenario I will be just reading JSON data provided above and work on that dataframe
Upvotes: 1
Views: 4552
Reputation: 23109
You can pass the schema
from prevous dataframe while reading the json
data
println("Input schema")
df.printSchema()
df.write.mode(SaveMode.Overwrite).json("D:\\save\\output")
println("Output schema")
spark.read.schema(df.schema).json("D:\\save\\output")
Input schema
root
|-- name: string (nullable = true)
|-- department: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- Id: string (nullable = true)
| | |-- Description: string (nullable = true)
Output schema
root
|-- name: string (nullable = true)
|-- department: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- Id: string (nullable = true)
| | |-- Description: string (nullable = true)
Hope this helps!
Upvotes: 3