belka
belka

Reputation: 1530

How to add a nested column to a DataFrame

I have a dataframe df with the following schema:

root
 |-- city_name: string (nullable = true)
 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- name: string (nullable = true)

What I want to do is add a nested column, say car_brand to my person structure. How would I do it?

The expected final schema would look like this:

root
 |-- city_name: string (nullable = true)
 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- car_brand: string (nullable = true)

Upvotes: 3

Views: 7666

Answers (4)

Prabhatika Vij
Prabhatika Vij

Reputation: 347

The function withField is available starting Spark 3.1. As per the doc, "it can be used to add/replace a nested field in StructType by name".

In this case, it can be used as follows: -

import org.apache.spark.sql.functions

df.withColumn("person", functions.col("person").withField("car_brand", functions.col("some car brand here")))

Upvotes: 1

user2310605
user2310605

Reputation: 36

import pyspark.sql.functions as func
dF = dF.withColumn(
        "person",
   func.struct(
            "person.age",
                func.struct(
                            "person.name",
                            func.lit(None).alias("NestedCol_Name")
                    ).alias("name")
       )
       )
O/P Schema:-
root
 |-- city_name: string (nullable = true)
 |-- person: struct (nullable = false)
 |    |-- age: string (nullable = true)
 |    |-- name: struct (nullable = false)
 |    |    |-- name: string (nullable = true)
 |    |    |-- NestedCol_Name: null (nullable = true)

Upvotes: 1

Vijayant
Vijayant

Reputation: 732

Adding a new nested column within person:

df = df.withColumn(
        "person",
        struct(
            $"person.*",
            struct(
                lit("value_1").as("person_field_1"),
                lit("value_2").as("person_field_2"),
            ).as("nested_column_within_person")
       )
    )

Final schema :

root
 |-- city_name: string (nullable = true)
 |-- person: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- nested_column_within_person: struct (nullable = true)
 |    |    |-- person_field_1: string (nullable = true)
 |    |    |-- person_field_2: string (nullable = true)

Upvotes: 2

Shaido
Shaido

Reputation: 28322

You can unpack the struct and add it to a new one, including the new column at the same time. For example, adding "bmw" to all persons in the dataframe be done like this:

df.withColumn("person", struct($"person.*", lit("bmw").as("car_brand")))

Upvotes: 5

Related Questions