Reputation: 13
I am trying to compute percentage difference over columns for each row in a dataframe. Here is my dataset:
For example, for the first row, I am trying to get a variation rate of 2016 compared to 2015, of 2017 compared to 2016... Only 2015 and 2019 should be removed, so that they will be 5 columns at the end.
I know that window and lag can be help achieving it, but I stay unsuccessful until now.
Upvotes: 0
Views: 727
Reputation: 42352
No window functions should be needed. You just need to calculate the % change by arithmetic operations on the columns, if I understood the question correctly.
import pyspark.sql.functions as F
df2 = df.select(
'city', 'postal_code',
*[((F.col(str(year)) - F.col(str(year-1))) / F.col(str(year-1))).alias('percent_change_%s'%year)
for year in [2016,2017,2018,2019]]
)
Also I don't understand why you want 5 columns at the end. Isn't it 6? Why is 2019 removed? You can calculate % change by (2019-2018)/2018, for instance.
Upvotes: 1