% difference over columns in PySpark for each row

Question

I am trying to compute percentage difference over columns for each row in a dataframe. Here is my dataset:

For example, for the first row, I am trying to get a variation rate of 2016 compared to 2015, of 2017 compared to 2016... Only 2015 and 2019 should be removed, so that they will be 5 columns at the end.

I know that window and lag can be help achieving it, but I stay unsuccessful until now.

mck · Accepted Answer

No window functions should be needed. You just need to calculate the % change by arithmetic operations on the columns, if I understood the question correctly.

import pyspark.sql.functions as F

df2 = df.select(
    'city', 'postal_code',
    *[((F.col(str(year)) - F.col(str(year-1))) / F.col(str(year-1))).alias('percent_change_%s'%year)
      for year in [2016,2017,2018,2019]]
)

Also I don't understand why you want 5 columns at the end. Isn't it 6? Why is 2019 removed? You can calculate % change by (2019-2018)/2018, for instance.

% difference over columns in PySpark for each row

Answers (1)

Related Questions