Reputation: 1982
I'm playing with Apache spark and have encountered following situation. I have a Spark dataframe named 'data' which looks like in this format
Name Grade Count
X A 10
Y A 8
X B 2
Y B 4
Now, I want to groupBy this dataframe by 'Name' while calculating the ratio between Grade A and Grade B. For an example for 'X', it would be 10/2=5
.
Name Ratio
X 5
Y 4
PySpark version 1.3.0
Upvotes: 0
Views: 444
Reputation: 330423
Simple aggregation with when
should work just fine:
from pyspark.sql.functions import col, sum, when
a_count = sum(when(col("grade") == "A", col("count")))
b_count = sum(when(col("grade") == "B", col("count")))
data.groupBy("name").agg((a_count / b_count).alias("ratio"))
or with join
:
a = (data_agg.where(col("grade") == "A")
.groupby("name")
.agg(sum("count").alias("cnt"))
.alias("a"))
b = (data_agg.where(col("grade") == "B")
.groupby("name")
.agg(sum("count").alias("cnt"))
.alias("b"))
(a.join(b, col("a.name") == col("b.name"))
.select(col("a.name"), (col("a.cnt") / col("b.cnt"))).alias("ratio"))
Upvotes: 4