Reputation: 221
I have the following PySpark DataFrame
where each column represents a time series and I'd like to study their distance to the mean.
+----+----+-----+---------+
| T1 | T2 | ... | Average |
+----+----+-----+---------+
| 1 | 2 | ... | 2 |
| -1 | 5 | ... | 4 |
+----+----+-----+---------+
This is what I'm hoping to get:
+----+----+-----+---------+
| T1 | T2 | ... | Average |
+----+----+-----+---------+
| -1 | 0 | ... | 2 |
| -5 | 1 | ... | 4 |
+----+----+-----+---------+
Up until now, I've tried naively running a UDF on individual columns but it takes respectively 30s-50s-80s... (keeps increasing) per column so I'm probably doing something wrong.
cols = ["T1", "T2", ...]
for c in cols:
df = df.withColumn(c, df[c] - df["Average"])
Is there a better way to do this transformation of adding one column to many other?
Upvotes: 0
Views: 27
Reputation: 13591
By using rdd, it can be done in this way.
+---+---+-------+
|T1 |T2 |Average|
+---+---+-------+
|1 |2 |2 |
|-1 |5 |4 |
+---+---+-------+
df.rdd.map(lambda r: (*[r[i] - r[-1] for i in range(0, len(r) - 1)], r[-1])) \
.toDF(df.columns).show()
+---+---+-------+
| T1| T2|Average|
+---+---+-------+
| -1| 0| 2|
| -5| 1| 4|
+---+---+-------+
Upvotes: 2