Reputation: 525
I have two dataframes.
One is coming from groupBy and the other is the total summary:
a = data.groupBy("bucket").agg(sum(a.total))
b = data.agg(sum(a.total))
I want to put the total from b to a dataframe so that I can calculate the % on each bucket.
Do you know what kind of join I shall use?
Upvotes: 1
Views: 110
Reputation: 83
You can use also collect() as you will return to the driver just a simple result
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.getOrCreate()
df = spark.sql("select 'A' as bucket, 5 as value union all select 'B' as bucket, 8 as value")
df_total = spark.sql("select 9 as total")
df=df.withColumn('total',lit(df_total.collect()[0]['total']))
+------+-----+-----+
|bucket|value|total|
+------+-----+-----+
| A| 5| 9|
| B| 8| 9|
+------+-----+-----+
df= df.withColumn('pourcentage', col('total') / col('value'))
+------+-----+-----+-----------+
|bucket|value|total|pourcentage|
+------+-----+-----+-----------+
| A| 5| 9| 1.8|
| B| 8| 9| 1.125|
+------+-----+-----+-----------+
Upvotes: 0
Reputation: 31540
Use .crossJoin
then you will get the total from b
added to all rows of df a
, then you can calculate the percentage.
Example:
a.crossJoin(b).show()
#+------+----------+----------+
#|bucket|sum(total)|sum(total)|
#+------+----------+----------+
#| c| 4| 10|
#| b| 3| 10|
#| a| 3| 10|
#+------+----------+----------+
Instead of CrossJoin
you can try using window functions as mentioned below.
df.show()
#+-----+------+
#|total|bucket|
#+-----+------+
#| 1| a|
#| 2| a|
#| 3| b|
#| 4| c|
#+-----+------+
from pyspark.sql.functions import *
from pyspark.sql import *
from pyspark.sql.window import *
import sys
w=Window.partitionBy(col("bucket"))
w1=Window.orderBy(lit("1")).rowsBetween(-sys.maxsize,sys.maxsize)
df.withColumn("sum_b",sum(col("total")).over(w)).withColumn("sum_c",sum(col("total")).over(w1)).show()
#+-----+------+-----+-----+
#|total|bucket|sum_b|sum_c|
#+-----+------+-----+-----+
#| 4| c| 4| 10|
#| 3| b| 3| 10|
#| 1| a| 3| 10|
#| 2| a| 3| 10|
#+-----+------+-----+-----+
Upvotes: 3