How to join two dataframes together

Question

I have two dataframes.

One is coming from groupBy and the other is the total summary:

a = data.groupBy("bucket").agg(sum(a.total))
b = data.agg(sum(a.total))

I want to put the total from b to a dataframe so that I can calculate the % on each bucket.

Do you know what kind of join I shall use?

notNull · Accepted Answer

Use .crossJoin then you will get the total from b added to all rows of df a, then you can calculate the percentage.

Example:

a.crossJoin(b).show()
#+------+----------+----------+
#|bucket|sum(total)|sum(total)|
#+------+----------+----------+
#|     c|         4|        10|
#|     b|         3|        10|
#|     a|         3|        10|
#+------+----------+----------+

Instead of CrossJoin you can try using window functions as mentioned below.

df.show()
#+-----+------+
#|total|bucket|
#+-----+------+
#|    1|     a|
#|    2|     a|
#|    3|     b|
#|    4|     c|
#+-----+------+

from pyspark.sql.functions import *
from pyspark.sql import *
from pyspark.sql.window import *
import sys

w=Window.partitionBy(col("bucket"))
w1=Window.orderBy(lit("1")).rowsBetween(-sys.maxsize,sys.maxsize)

df.withColumn("sum_b",sum(col("total")).over(w)).withColumn("sum_c",sum(col("total")).over(w1)).show()
#+-----+------+-----+-----+
#|total|bucket|sum_b|sum_c|
#+-----+------+-----+-----+
#|    4|     c|    4|   10|
#|    3|     b|    3|   10|
#|    1|     a|    3|   10|
#|    2|     a|    3|   10|
#+-----+------+-----+-----+

How to join two dataframes together

Answers (2)

Related Questions