architectonic
architectonic

Reputation: 3129

Add an empty column to Spark DataFrame

As mentioned in many other locations on the web, adding a new column to an existing DataFrame is not straightforward. Unfortunately it is important to have this functionality (even though it is inefficient in a distributed environment) especially when trying to concatenate two DataFrames using unionAll.

What is the most elegant workaround for adding a null column to a DataFrame to facilitate a unionAll?

My version goes like this:

from pyspark.sql.types import StringType
from pyspark.sql.functions import UserDefinedFunction
to_none = UserDefinedFunction(lambda x: None, StringType())
new_df = old_df.withColumn('new_column', to_none(df_old['any_col_from_old']))

Upvotes: 103

Views: 217330

Answers (6)

aax
aax

Reputation: 454

Since Pyspark 2.3 you can also use the function unionByName with the option allowMissingColumns=True for your original purpose of unioning two dataframes with partly different columns.

https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrame.unionByName.html

Upvotes: 1

Underoos
Underoos

Reputation: 5190

Why not just use this?

from pyspark.sql.functions import lit
df.withColumn("column_name", lit("")).show()

Upvotes: 0

G.G
G.G

Reputation: 765

df1.selectExpr("school","null as col1").show()

output:

+--------------------+----+
|              school|col1|
+--------------------+----+
|Shanghai Jiao Ton...|null|
|   Peking University|null|
|Shanghai Jiao Ton...|null|
|    Fudan University|null|
|    Fudan University|null|
| Tsinghua University|null|
|Shanghai Jiao Ton...|null|
| Tsinghua University|null|
| Tsinghua University|null|
|   Peking University|null|

or in pyspark 2.2+

df1.pandas_api().assign(new_column=None)

Upvotes: 1

ZygD
ZygD

Reputation: 24386

The option without import StringType

df = df.withColumn('foo', F.lit(None).cast('string'))

Full example:

from pyspark.sql import functions as F
df = spark.range(1, 3).toDF('c')

df = df.withColumn('foo', F.lit(None).cast('string'))

df.printSchema()
#     root
#      |-- c: long (nullable = false)
#      |-- foo: string (nullable = true)

df.show()
#     +---+----+
#     |  c| foo|
#     +---+----+
#     |  1|null|
#     |  2|null|
#     +---+----+

Upvotes: 7

zero323
zero323

Reputation: 330093

All you need here is importing StringType and using lit and cast:

from pyspark.sql.types import StringType
from pyspark.sql.functions import lit

new_df = old_df.withColumn('new_column', lit(None).cast(StringType()))

A full example:

df = sc.parallelize([row(1, "2"), row(2, "3")]).toDF()
df.printSchema()
# root
#  |-- foo: long (nullable = true)
#  |-- bar: string (nullable = true)

new_df = df.withColumn('new_column', lit(None).cast(StringType()))

new_df.printSchema()
# root
#  |-- foo: long (nullable = true)
#  |-- bar: string (nullable = true)
#  |-- new_column: string (nullable = true)

new_df.show()
# +---+---+----------+
# |foo|bar|new_column|
# +---+---+----------+
# |  1|  2|      null|
# |  2|  3|      null|
# +---+---+----------+

A Scala equivalent can be found here: Create new Dataframe with empty/null field values

Upvotes: 204

Shrikant Prabhu
Shrikant Prabhu

Reputation: 735

I would cast lit(None) to NullType instead of StringType. So that if we ever have to filter out not null rows on that column...it can be easily done as follows

df = sc.parallelize([Row(1, "2"), Row(2, "3")]).toDF()

new_df = df.withColumn('new_column', lit(None).cast(NullType()))

new_df.printSchema() 

df_null = new_df.filter(col("new_column").isNull()).show()
df_non_null = new_df.filter(col("new_column").isNotNull()).show()

Also be careful about not using lit("None")(with quotes) if you are casting to StringType since it would fail for searching for records with filter condition .isNull() on col("new_column").

Upvotes: 10

Related Questions