How can I add one column to other columns in PySpark?

Question

I have the following PySpark DataFrame where each column represents a time series and I'd like to study their distance to the mean.

+----+----+-----+---------+
| T1 | T2 | ... | Average |
+----+----+-----+---------+
| 1  | 2  | ... |  2      |
| -1 | 5  | ... |  4      |
+----+----+-----+---------+

This is what I'm hoping to get:

+----+----+-----+---------+
| T1 | T2 | ... | Average |
+----+----+-----+---------+
| -1 | 0  | ... |  2      |
| -5 | 1  | ... |  4      |
+----+----+-----+---------+

Up until now, I've tried naively running a UDF on individual columns but it takes respectively 30s-50s-80s... (keeps increasing) per column so I'm probably doing something wrong.

cols = ["T1", "T2", ...]
for c in cols:
    df = df.withColumn(c, df[c] - df["Average"])

Is there a better way to do this transformation of adding one column to many other?

Lamanus · Accepted Answer

By using rdd, it can be done in this way.

+---+---+-------+
|T1 |T2 |Average|
+---+---+-------+
|1  |2  |2      |
|-1 |5  |4      |
+---+---+-------+

df.rdd.map(lambda r: (*[r[i] - r[-1] for i in range(0, len(r) - 1)], r[-1])) \
  .toDF(df.columns).show()

+---+---+-------+
| T1| T2|Average|
+---+---+-------+
| -1|  0|      2|
| -5|  1|      4|
+---+---+-------+

How can I add one column to other columns in PySpark?

Answers (1)

Related Questions