Pyspark Error when I try to modify a column based in a condition with when otherwise

Question

I am using Pyspark 3.0.1

I want to modify a column when a condition is met else I want to keep the same value I had.

df.printSchema()

root
 |-- ID: decimal(4,0) (nullable = true)
 |-- Provider: string (nullable = true)
 |-- Principal: float (nullable = false)
 |-- PRINCIPALBALANCE: float (nullable = true)
 |-- STATUS: integer (nullable = true)
 |-- Installment Rate: float (nullable = true)
 |-- Yearly Percentage: float (nullable = true)
 |-- Processing Fee Percentage: double (nullable = true)
 |-- Disb Date: string (nullable = true)
 |-- ZOHOID: integer (nullable = true)
 |-- UPFRONTPROCESSINGFEEBALANCE: float (nullable = true)
 |-- WITHHOLDINGTAXBALANCE: float (nullable = true)
 |-- UPFRONTPROCESSINGFEEPERCENTAGE: float (nullable = true)
 |-- UPFRONTPROCESSINGFEEWHTPERCENTAGE: float (nullable = true)
 |-- PROCESSINGFEEWHTPERCENTAGE: float (nullable = true)
 |-- PROCESSINGFEEVATPERCENTAGE: float (nullable = true)
 |-- BUSINESSSHORTCODE: string (nullable = true)
 |-- EXCTRACTIONDATE: timestamp (nullable = true)
 |-- fake Fee: double (nullable = false)
 |-- fake WHT: string (nullable = true)
 |-- fake Fee_WHT: string (nullable = true)
 |-- Agency Fee CP: string (nullable = true)
 |-- Agency VAT CP: string (nullable = true)
 |-- Agency WHT CP: string (nullable = true)
 |-- Agency Fee_VAT_WHT CP: string (nullable = true)

df.head(1)

[Row(ID=Decimal('16'), Provider='fake', Principal=2000.01, PRINCIPALBALANCE=0.2, STATUS=4, Installment Rate=0.33333333, Yearly Percentage=600.0, Processing Fee Percentage=0.20, Disb Date=None, ZOHOID=3000, UPFRONTPROCESSINGFEEBALANCE=None, WITHHOLDINGTAXBALANCE=None, UPFRONTPROCESSINGFEEPERCENTAGE=None, UPFRONTPROCESSINGFEEWHTPERCENTAGE=None, PROCESSINGFEEWHTPERCENTAGE=None, PROCESSINGFEEVATPERCENTAGE=16.0, BUSINESSSHORTCODE='20005', EXCTRACTIONDATE=datetime.datetime(2020, 11, 25, 5, 7, 58, 6000), fake Fee=1770.7, fake WHT='312.48', fake Fee_WHT='2,083.18', Agency Fee CP='566.62', Agency VAT CP='566.62', Agency WHT CP='186.39', Agency Fee_VAT_WHT CP='5,394.41')]

I have read that I can do it with when and otherwise, but I have this error when I run it with this code:

from pyspark.sql.functions import when
df.withColumn('Gross Loan Amount',when(((df['Disb Date'] <='2018-03-19') &(df['ID']!=457))
                                       ,(df['Principal']+df['Agency Fee CP']+df['Agency VAT CP']).otherwise(df['Gross Loan Amount'])))

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
 in 
      1 from pyspark.sql.functions import when
      2 df.withColumn('Gross Loan Amount',when(((df['Disb Date'] <='2018-03-19') &(df['ID']!=457))
----> 3                                        ,(df['Principal']+df['Agency Fee CP']+df['Agency VAT CP']).otherwise(df['Gross Loan Amount'])))

/usr/local/spark/python/pyspark/sql/dataframe.py in __getitem__(self, item)
   1378         """
   1379         if isinstance(item, basestring):
-> 1380             jc = self._jdf.apply(item)
   1381             return Column(jc)
   1382         elif isinstance(item, Column):

/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py in __call__(self, *args)
   1303         answer = self.gateway_client.send_command(command)
   1304         return_value = get_return_value(
-> 1305             answer, self.gateway_client, self.target_id, self.name)
   1306 
   1307         for temp_arg in temp_args:

/usr/local/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
    132                 # Hide where the exception came from that shows a non-Pythonic
    133                 # JVM exception message.
--> 134                 raise_from(converted)
    135             else:
    136                 raise

/usr/local/spark/python/pyspark/sql/utils.py in raise_from(e)

AnalysisException: Cannot resolve column name "Gross Loan Amount" among (ID, Provider, Principal, PRINCIPALBALANCE, STATUS, Installment Rate, Yearly Percentage, Processing Fee Percentage, Disb Date, ZOHOID, UPFRONTPROCESSINGFEEBALANCE, WITHHOLDINGTAXBALANCE, UPFRONTPROCESSINGFEEPERCENTAGE, UPFRONTPROCESSINGFEEWHTPERCENTAGE, PROCESSINGFEEWHTPERCENTAGE, PROCESSINGFEEVATPERCENTAGE, BUSINESSSHORTCODE, EXCTRACTIONDATE, fake Fee, fake WHT, fake Fee_WHT, Agency Fee CP, Agency VAT CP, Agency WHT CP, Agency Fee_VAT_WHT CP);

I have created a fake dataframe to test if it really works:

df_test = spark.createDataFrame(
    [
        (1, 'foo','a'), # create your data here, be consistent in the types.
        (2, 'bar','b'),
        (2, 'fee','c'),
    ],
    ['id', 'txt','letter'] # add your columns label here
)

df_test.show()

+---+---+------+
| id|txt|letter|
+---+---+------+
|  1|foo|     a|
|  2|bar|     b|
|  2|fee|     c|
+---+---+------+

df_test.withColumn('txt',when(df_test['id']==1,'change').otherwise(df_test['txt'])).show()

+---+------+------+
| id|   txt|letter|
+---+------+------+
|  1|change|     a|
|  2|   bar|     b|
|  2|   fee|     c|
+---+------+------+

What I am doing wrong or I am not considering?

mck · Accepted Answer

Gross Loan Amount is not a column in the dataframe, so the otherwise statement cannot be resolved in your first example.

But in your second example, txt is an existing column, so the otherwise statement can be resolved.

Are you sure you want to modify the Gross Loan Amount column which does not exist?

Pyspark Error when I try to modify a column based in a condition with when otherwise

Answers (1)

Related Questions