Reputation: 1744
scala> val df = spark.read.json("data.json")
scala> df.printSchema
root
|-- a: struct (nullable = true)
| |-- b: struct (nullable = true)
| | |-- c: long (nullable = true)
|-- **TimeStamp: string (nullable = true)**
|-- id: string (nullable = true)
scala> val df1 = df.withColumn("TimeStamp", $"TimeStamp".cast(TimestampType))
scala> df1.printSchema
root
|-- a: struct (nullable = true)
| |-- b: struct (nullable = true)
| | |-- c: long (nullable = true)
|-- **TimeStamp: timestamp (nullable = true)** // WORKING AS EXPECTED
|-- id: string (nullable = true)
scala> val df2 = df.withColumn("a.b.c", $"a.b.c".cast(DoubleType))
scala> df2.printSchema
root
|-- a: struct (nullable = true)
| |-- b: struct (nullable = true)
| | |-- c: long (nullable = true)
|-- TimeStamp: string (nullable = true)
|-- id: string (nullable = true)
|-- **a.b.c: double (nullable = true)** // DUPLICATE COLUMN ADDED
I'm trying to change the type of a nested JSON attribute within a data frame column. the change in a nested attribute has been treated as a new column which resulting a duplicate column. the change is working fine for the top level attributes (Timestamp) but not for the nested ones (a.b.c). Any thoughts on this problem ?.
Upvotes: 2
Views: 515
Reputation: 1323
since your column is of struct type & you need to build it again in the same hierarchy. Because it doesn't go by assumption, it thinks that you are rewriting the structure. Input:
{"a": {"b": {"c": "1.31", "d": "1.11"}}, "TimeStamp": "2017-02-18", "id":1}
{"a": {"b": {"c": "2.31", "d": "2.22"}}, "TimeStamp": "2017-02-18", "id":1}
val lines2 = spark.read.json("/home/kiran/km/km_hadoop/data/data_nested_struct_col2.json")
lines2.printSchema()
val df2 = lines2.withColumn("a", struct(
struct(
lines2("a.b.c").cast(DoubleType).as("c"),
lines2("a.b.d").as("d")
).as("b")))
.withColumn("TimeStamp", lines2("TimeStamp").cast(DateType))
df2.printSchema()
This is output of both schemas before & after:
root
|-- TimeStamp: string (nullable = true)
|-- a: struct (nullable = true)
| |-- b: struct (nullable = true)
| | |-- c: string (nullable = true)
| | |-- d: string (nullable = true)
|-- id: long (nullable = true)
root
|-- TimeStamp: date (nullable = true)
|-- a: struct (nullable = false)
| |-- b: struct (nullable = false)
| | |-- c: double (nullable = true)
| | |-- d: string (nullable = true)
|-- id: long (nullable = true)
I hope it is clear.
Upvotes: 1