newbie
newbie

Reputation: 1412

How to concatenate to a null column in pyspark dataframe

I have a below dataframe and I wanted to update the rows dynamically with some values

input_frame.show()
+----------+----------+---------+
|student_id|name      |timestamp|
+----------+----------+---------+
|        s1|testuser  |       t1|
|        s1|sampleuser|       t2|
|        s2|test123   |       t1|
|        s2|sample123 |       t2|
+----------+----------+---------+

input_frame = input_frame.withColumn('test', sf.lit(None))
input_frame.show()
+----------+----------+---------+----+
|student_id|      name|timestamp|test|
+----------+----------+---------+----+
|        s1|  testuser|       t1|null|
|        s1|sampleuser|       t2|null|
|        s2|   test123|       t1|null|
|        s2| sample123|       t2|null|
+----------+----------+---------+----+

input_frame = input_frame.withColumn('test', sf.concat(sf.col('test'),sf.lit('test')))
input_frame.show()
+----------+----------+---------+----+
|student_id|      name|timestamp|test|
+----------+----------+---------+----+
|        s1|  testuser|       t1|null|
|        s1|sampleuser|       t2|null|
|        s2|   test123|       t1|null|
|        s2| sample123|       t2|null|
+----------+----------+---------+----+

I want to update the 'test' column with some values and apply the filter with partial matches on the column. But concatenating to null column resulting in a null column again. How can we do this?

Upvotes: 11

Views: 27465

Answers (3)

Santhosh Urumese
Santhosh Urumese

Reputation: 171

You can use the coalesce function, which returns first of its arguments which is not null, and provide a literal in the second place, which will be used in case the column has a null value.

df = df.withColumn("concat", concat(coalesce(df.a, lit('')), coalesce(df.b, lit(''))))

Upvotes: 5

Doug
Doug

Reputation: 35116

use concat_ws, like this:

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([["1", "2"], ["2", None], ["3", "4"], ["4", "5"], [None, "6"]]).toDF("a", "b")

# This won't work
df = df.withColumn("concat", concat(df.a, df.b))

# This won't work
df = df.withColumn("concat + cast", concat(df.a.cast('string'), df.b.cast('string')))

# Do it like this
df = df.withColumn("concat_ws", concat_ws("", df.a, df.b))
df.show()

gives:

+----+----+------+-------------+---------+
|   a|   b|concat|concat + cast|concat_ws|
+----+----+------+-------------+---------+
|   1|   2|    12|           12|       12|
|   2|null|  null|         null|        2|
|   3|   4|    34|           34|       34|
|   4|   5|    45|           45|       45|
|null|   6|  null|         null|        6|
+----+----+------+-------------+---------+

Note specifically that casting a NULL column to string doesn't work as you wish, and will result in the entire row being NULL if any column is null.

There's no nice way of dealing with more complicated scenarios, but note that you can use a when statement in side a concat if you're willing to suffer the verboseness of it, like this:

df.withColumn("concat_custom", concat(
  when(df.a.isNull(), lit('_')).otherwise(df.a), 
  when(df.b.isNull(), lit('_')).otherwise(df.b))
)

To get, eg:

+----+----+-------------+
|   a|   b|concat_custom|
+----+----+-------------+
|   1|   2|           12|
|   2|null|           2_|
|   3|   4|           34|
|   4|   5|           45|
|null|   6|           _6|
+----+----+-------------+

Upvotes: 20

ARCrow
ARCrow

Reputation: 1858

You can fill null values with empty strings:

import pyspark.sql.functions as f
from pyspark.sql.types import *
data = spark.createDataFrame([('s1', 't1'), ('s2', 't2')], ['col1', 'col2'])
data = data.withColumn('test', f.lit(None).cast(StringType()))
display(data.na.fill('').withColumn('test2', f.concat('col1', 'col2', 'test')))

Is that what you were looking for?

Upvotes: 0

Related Questions