Ray88
Ray88

Reputation: 1

Is there a way to change “globally” the way pyspark deals with NULL values in different operations (eg. addition, substraction, multiplication)

I have a list of formulas (string format, that I loop using exec()) which I have to run through in order to calculate specific financial ratios (200 in total).

The formulas are written in pyspark syntax and can be as following: “F.when(F.col(“a885”).isNull(),F.lit(999)).otherwise((F.col(“a888”)+F.col(“a889”))/F.col(“a885”))” or “F.col(“a225”) * F.col(“a111”)”.

However, I need to apply certain conditions when it comes to NULL values.

Speicfically,

  1. addition/substraction:
  1. multiplication term1 * term2

Since, there are 200 formulas, I am looking for a way to avoid manually updating each formula by adding conditions for NULL treatment (coalesce with 0, depending on each scenario and so on). Is there a way to globally adjust the attributes of these operations? So, for example, whenever I execute the string, the operation “+” will have the explained attributes.Any idea is appreciated, whether it pertains to defining a new class for PySpark tables, creating a special UDF that would automatically make the changes, or even attempting to slightly adjust PySpark configuration (although I don’t believe I can do much from this perspective).

I already tried to define a new class and assign the NULL treatment to its method. I tried something like:

 class Number:
    def __init__(self, value):
        self.value = value

    def __add__(self, other):
        if isinstance(other, Number):
            if self.value is None:
                return other.value
            elif other.value is None:
                return self.value
            else:
                return self.value + other.value
        elif other is None:
            return self.value
        else:
            return self.value + other

However, I am not sure if this works with pyspark tables and how to apply this class to it. Also, I don’t know if this method would be optimal when calculating 200 new columns for a big number of rows.

Upvotes: 0

Views: 40

Answers (0)

Related Questions