Reputation: 99
Spark with scala
My Product data frame looks like
Product countnum
+----------------+----------------+
|Resistor | 2|
|Capacitor | 2|
|Inductor | 3|
+----------------+----------------+
i am trying to get overall percentage like below:-
Product countnum Overall Profit
+----------------+----------------+--+----------------+
|Resistor | 2|28%-- 2/7*100(Here 7 is total countnum)
|Capacitor | 2|28%-- 2/7*100(Here 7 is total countnum)
|Inductor | 3|42.85%-- 3/7*100(Here 7 is total countnum)
+----------------+----------------+--+----------------+
Code:
val df = newdf.select("countnum"/(sum("countnum") as("percentage") ,"product","countnum",)
but it is giving error as "cannot resolve overloaded method 'select'
am i missing something here?
Upvotes: 2
Views: 162
Reputation: 31490
Use window function to get sum(countnum)
and then divide
Example:
Approach1: Using window functions
df.show()
//+---------+--------+
//| product|countnum|
//+---------+--------+
//| Resistor| 2|
//|capacitor| 2|
//| inductor| 3|
//+---------+--------+
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val w=Window.orderBy("countnum").rowsBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df.withColumn("overallprofit",col("countnum")/sum(col("countnum")).over(w)*100).show()
//+---------+--------+------------------+
//| product|countnum| overallprofit|
//+---------+--------+------------------+
//| Resistor| 2| 28.57142857142857|
//|capacitor| 2| 28.57142857142857|
//| inductor| 3|42.857142857142854|
//+---------+--------+------------------+
Approach2: Without using window functions
val df1=df.select(sum("countnum").alias("overall_sum"))
df.crossJoin(df1).withColumn("overallprofit",col("countnum")/col("overall_sum")*100).
drop("overall_sum").
show()
//+---------+--------+------------------+
//| product|countnum| overallprofit|
//+---------+--------+------------------+
//| Resistor| 2| 28.57142857142857|
//|capacitor| 2| 28.57142857142857|
//| inductor| 3|42.857142857142854|
//+---------+--------+------------------+
//rounding
df.crossJoin(df1).withColumn("overallprofit",concat(round(col("countnum")/col("overall_sum")*100,2),lit("%"))).drop("overall_sum").show()
//+---------+--------+-------------+
//| product|countnum|overallprofit|
//+---------+--------+-------------+
//| Resistor| 2| 28.57%|
//|capacitor| 2| 28.57%|
//| inductor| 3| 42.86%|
//+---------+--------+-------------+
//cast to decimal with one precision
df.crossJoin(df1).
withColumn("overallprofit",concat((col("countnum")/col("overall_sum")*100).cast("decimal(10,1)"),lit("%"))).
drop("overall_sum").
show()
//+---------+--------+-------------+
//| product|countnum|overallprofit|
//+---------+--------+-------------+
//| Resistor| 2| 28.6%|
//|capacitor| 2| 28.6%|
//| inductor| 3| 42.9%|
//+---------+--------+-------------+
Upvotes: 1
Reputation: 2072
The below code may help you.
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(("Resistor", 2), ("Capacitor",2), ("Inductor", 3))).toDF("Product", "countnum")
df.show()
/*
+---------+--------+
| Product|countnum|
+---------+--------+
| Resistor| 2|
|Capacitor| 2|
| Inductor| 3|
+---------+--------+
*/
df.groupBy("Product").agg(sum("countnum").alias("sum")).withColumn("fraction", round(((col("sum") / sum("sum").over()) *100),2)).show()
/*
+---------+---+--------+
| Product|sum|fraction|
+---------+---+--------+
| Inductor| 3| 42.86|
| Resistor| 2| 28.57|
|Capacitor| 2| 28.57|
+---------+---+--------+
*/
Upvotes: 1