Reputation: 91
I am new to Spark and this might be a straightforward problem.
I have a table in pyspark DF1, with a column name "A" and some other columns. The other column names might change in various cases. I want to generate a table DF2 by dividing the values of these "other" columns over A. The columns of DF2 would also be the "other" columns.
For example for
DF1 = sql_sc.createDataFrame([(1, 2,3), (2, 4,6), (3, 6,9), (4, 8,12), (5, 10,15)], ["A", "B","C"])
The result would be a table as:
DF2 = sql_sc.createDataFrame([(2,3), (2,3), (2,3), (2,3), (2,3)], ["B","C"]
How can I do this task in pyspark SQL ?
Thanks
Upvotes: 1
Views: 4410
Reputation: 8513
You can use DataFrame.columns to iterate over the columns and then just use the divide operator to perform your row-wise math. Like this:
from pyspark.sql.functions import col
DF1 = spark.createDataFrame([(1, 2,3), (2, 4,6), (3, 6,9), (4, 8,12), (5, 10,15)], ["A", "B","C"])
DF1.show()
df = DF1
for field in DF1.columns:
if field != 'A':
df = df.withColumn(field, col(field) / col("A"))
DF2 = df.drop('A')
DF2.show()
Which outputs this:
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 2| 4| 6|
| 3| 6| 9|
| 4| 8| 12|
| 5| 10| 15|
+---+---+---+
+---+---+
| B| C|
+---+---+
|2.0|3.0|
|2.0|3.0|
|2.0|3.0|
|2.0|3.0|
|2.0|3.0|
+---+---+
Upvotes: 2