chocobooster
chocobooster

Reputation: 13

% difference over columns in PySpark for each row

I am trying to compute percentage difference over columns for each row in a dataframe. Here is my dataset:

Dataset sample

For example, for the first row, I am trying to get a variation rate of 2016 compared to 2015, of 2017 compared to 2016... Only 2015 and 2019 should be removed, so that they will be 5 columns at the end.

I know that window and lag can be help achieving it, but I stay unsuccessful until now.

Upvotes: 0

Views: 727

Answers (1)

mck
mck

Reputation: 42352

No window functions should be needed. You just need to calculate the % change by arithmetic operations on the columns, if I understood the question correctly.

import pyspark.sql.functions as F

df2 = df.select(
    'city', 'postal_code',
    *[((F.col(str(year)) - F.col(str(year-1))) / F.col(str(year-1))).alias('percent_change_%s'%year)
      for year in [2016,2017,2018,2019]]
)

Also I don't understand why you want 5 columns at the end. Isn't it 6? Why is 2019 removed? You can calculate % change by (2019-2018)/2018, for instance.

Upvotes: 1

Related Questions