Perform a groupBy on a dataframe while doing a computation in Apache Spark through PySpark

Question

I'm playing with Apache spark and have encountered following situation. I have a Spark dataframe named 'data' which looks like in this format

Name Grade Count
X     A     10
Y     A     8
X     B     2
Y     B     4

Now, I want to groupBy this dataframe by 'Name' while calculating the ratio between Grade A and Grade B. For an example for 'X', it would be 10/2=5.

Name Ratio
X     5
Y     4

PySpark version 1.3.0

zero323 · Accepted Answer

Simple aggregation with when should work just fine:

from pyspark.sql.functions import col, sum, when

a_count = sum(when(col("grade") == "A", col("count")))
b_count = sum(when(col("grade") == "B", col("count")))

data.groupBy("name").agg((a_count / b_count).alias("ratio"))

or with join:

a = (data_agg.where(col("grade") == "A")
  .groupby("name")
  .agg(sum("count").alias("cnt"))
  .alias("a"))

b = (data_agg.where(col("grade") == "B")
  .groupby("name")
  .agg(sum("count").alias("cnt"))
  .alias("b"))


(a.join(b, col("a.name") == col("b.name"))
    .select(col("a.name"), (col("a.cnt") / col("b.cnt"))).alias("ratio"))

Perform a groupBy on a dataframe while doing a computation in Apache Spark through PySpark

Answers (1)

Related Questions