np.where logic in pyspark dataframe

Question

I'm looking for a way to get character after 2nd place from a string in a dataframe column only if the length of the character is > 2 and place it into another column else null. I have several other columns in the spark dataframe

I have a Spark dataframe that looks like this:

animal
======
mo
cat
mouse
snake
reptiles

I want something like this:

remainder
========
null
t
use
ake
ptiles

I can do it using np.where in pandas dataframe like below

import numpy as np
df['remainder'] = np.where(len(df['animal]) > 2, df['animal].str[2:], 'null)

How do I do the same in pyspark dataframe

Vaebhav · Accepted Answer

You can easily do this with a combination of when-otherwise with substring

Data Preparation

s = StringIO("""
animal
mo
cat
mouse
snake
reptiles
""")

df = pd.read_csv(s,delimiter=',')

sparkDF = sql.createDataFrame(df)

sparkDF.show()

+--------+
|  animal|
+--------+
|      mo|
|     cat|
|   mouse|
|   snake|
|reptiles|
+--------+

When-Otherwise - Substring

sparkDF = sparkDF.withColumn('animal_length',F.length(F.col('animal'))) \
            .withColumn('remainder',F.when(F.col('animal_length') > 2
                                               ,F.substring(F.col('animal'),2,1000)
                                              ).otherwise(None)
                       ) \
            .drop('animal_length')

sparkDF.show()

+--------+---------+
|  animal|remainder|
+--------+---------+
|      mo|     null|
|     cat|       at|
|   mouse|     ouse|
|   snake|     nake|
|reptiles|  eptiles|
+--------+---------+

np.where logic in pyspark dataframe

Answers (1)

Data Preparation

When-Otherwise - Substring

Related Questions