Reputation: 27
I have flatten the nested JSON file now I am facing an ambiguity issue to get the actual column name using PySpark.
Dataframe with the following schema:
Before flattening:
root
|-- x: string (nullable = true)
|-- y: string (nullable = true)
|-- foo: struct (nullable = true)
| |-- a: float (nullable = true)
| |-- b: float (nullable = true)
| |-- c: integer (nullable = true)
After Flattening:
root
|-- x: string (nullable = true)
|-- y: string (nullable = true)
|-- foo_a: float (nullable = true)
|-- foo_b: float (nullable = true)
|-- foo_c: integer (nullable = true)
Is it possible to get only the actual name of the column in Data Frame as shown below:
root
|-- x: string (nullable = true)
|-- y: string (nullable = true)
|-- a: float (nullable = true)
|-- b: float (nullable = true)
|-- c: integer (nullable = true)
Upvotes: 1
Views: 255
Reputation: 87279
Yes, just do following instead of flattening:
select("*", "foo.*").drop("foo")
or
select("x", "y", "foo.*")
The foo.*
syntax pulls all fields from the struct and put them into the "top-level"
Upvotes: 1