Ravisaga
Ravisaga

Reputation: 27

Actual column name after flattening Nested JSON using PySpark

I have flatten the nested JSON file now I am facing an ambiguity issue to get the actual column name using PySpark.

Dataframe with the following schema:

Before flattening:

root
 |-- x: string (nullable = true)
 |-- y: string (nullable = true)
 |-- foo: struct (nullable = true)
 |    |-- a: float (nullable = true)
 |    |-- b: float (nullable = true)
 |    |-- c: integer (nullable = true)

After Flattening:

root
 |-- x: string (nullable = true)
 |-- y: string (nullable = true)
 |-- foo_a: float (nullable = true)
 |-- foo_b: float (nullable = true)
 |-- foo_c: integer (nullable = true)

Is it possible to get only the actual name of the column in Data Frame as shown below:

root
 |-- x: string (nullable = true)
 |-- y: string (nullable = true)
 |-- a: float (nullable = true)
 |-- b: float (nullable = true)
 |-- c: integer (nullable = true)
 

Upvotes: 1

Views: 255

Answers (1)

Alex Ott
Alex Ott

Reputation: 87279

Yes, just do following instead of flattening:

select("*", "foo.*").drop("foo")

or

select("x", "y", "foo.*")

The foo.* syntax pulls all fields from the struct and put them into the "top-level"

Upvotes: 1

Related Questions