Is there a way to change “globally” the way pyspark deals with NULL values in different operations (eg. addition, substraction, multiplication)

Question

I have a list of formulas (string format, that I loop using exec()) which I have to run through in order to calculate specific financial ratios (200 in total).

The formulas are written in pyspark syntax and can be as following: “F.when(F.col(“a885”).isNull(),F.lit(999)).otherwise((F.col(“a888”)+F.col(“a889”))/F.col(“a885”))” or “F.col(“a225”) * F.col(“a111”)”.

However, I need to apply certain conditions when it comes to NULL values.

Speicfically,

addition/substraction:

if one term is NULL, it will be treated as 0
if all terms are NULL, then result is NULL.

multiplication term1 * term2

if term1 is NULL, the result is NULL
if term2 is NULL, the result is 0

Since, there are 200 formulas, I am looking for a way to avoid manually updating each formula by adding conditions for NULL treatment (coalesce with 0, depending on each scenario and so on). Is there a way to globally adjust the attributes of these operations? So, for example, whenever I execute the string, the operation “+” will have the explained attributes.Any idea is appreciated, whether it pertains to defining a new class for PySpark tables, creating a special UDF that would automatically make the changes, or even attempting to slightly adjust PySpark configuration (although I don’t believe I can do much from this perspective).

I already tried to define a new class and assign the NULL treatment to its method. I tried something like:

 class Number:
    def __init__(self, value):
        self.value = value

    def __add__(self, other):
        if isinstance(other, Number):
            if self.value is None:
                return other.value
            elif other.value is None:
                return self.value
            else:
                return self.value + other.value
        elif other is None:
            return self.value
        else:
            return self.value + other

However, I am not sure if this works with pyspark tables and how to apply this class to it. Also, I don’t know if this method would be optimal when calculating 200 new columns for a big number of rows.

Is there a way to change “globally” the way pyspark deals with NULL values in different operations (eg. addition, substraction, multiplication)

Answers (0)

Related Questions