Reputation: 1
I have a list of formulas (string format, that I loop using exec()) which I have to run through in order to calculate specific financial ratios (200 in total).
The formulas are written in pyspark syntax and can be as following: “F.when(F.col(“a885”).isNull(),F.lit(999)).otherwise((F.col(“a888”)+F.col(“a889”))/F.col(“a885”))” or “F.col(“a225”) * F.col(“a111”)”.
However, I need to apply certain conditions when it comes to NULL values.
Speicfically,
Since, there are 200 formulas, I am looking for a way to avoid manually updating each formula by adding conditions for NULL treatment (coalesce with 0, depending on each scenario and so on). Is there a way to globally adjust the attributes of these operations? So, for example, whenever I execute the string, the operation “+” will have the explained attributes.Any idea is appreciated, whether it pertains to defining a new class for PySpark tables, creating a special UDF that would automatically make the changes, or even attempting to slightly adjust PySpark configuration (although I don’t believe I can do much from this perspective).
I already tried to define a new class and assign the NULL treatment to its method. I tried something like:
class Number:
def __init__(self, value):
self.value = value
def __add__(self, other):
if isinstance(other, Number):
if self.value is None:
return other.value
elif other.value is None:
return self.value
else:
return self.value + other.value
elif other is None:
return self.value
else:
return self.value + other
However, I am not sure if this works with pyspark tables and how to apply this class to it. Also, I don’t know if this method would be optimal when calculating 200 new columns for a big number of rows.
Upvotes: 0
Views: 40