How can I apply groupBy in a dataframe without removing other columns of the not-grouped instances in Pyspark?

Question

I am trying to generate an operation with groupBy() in Pyspark, but I get the next problem:

I have a dataframe (df1) which has 3 attributes: attrA, attrB and attrC. I want to apply a groupBy operation over that dataframe only taking in account the attributes attrA and attrB. Of course, when groupBy(attr1, attr2) is applied over df1 it generates groups of those instances that are equal to each other.

What I want to get is:

If I apply groupBy() operation and some instances are equal I want to generate an independent dataframe with those groups, and if there are instances that are not equal no any other one, I want to conserve those in another dataframe with the 3 attributes: attr1, attr2 and attr3(not used to group by).

Is it possible?

Zhang Tong · Accepted Answer

from pyspark.sql import functions as f
from pyspark.sql import *

spark = SparkSession.builder.appName('MyApp').getOrCreate()

df = spark.createDataFrame([('a', 'a', 3), ('a', 'c', 5), ('b', 'a', 4), ('c', 'a', 2), ('a', 'a', 9), ('b', 'a', 9)],
                           ('attr1', "attr2", "attr3"))
df = df.withColumn('count', f.count('attr3').over(Window().partitionBy('attr1', 'attr2'))).cache()

output:

+-----+-----+-----+-----+                                                       
|attr1|attr2|attr3|count|
+-----+-----+-----+-----+
|    b|    a|    4|    2|
|    b|    a|    9|    2|
|    a|    c|    5|    1|
|    c|    a|    2|    1|
|    a|    a|    3|    2|
|    a|    a|    9|    2|
+-----+-----+-----+-----+

and

an_independent_dataframe = df.filter(df['count'] > 1).groupBy('attr1', 'attr2').sum('attr3')
+-----+-----+----------+                                                        
|attr1|attr2|sum(attr3)|
+-----+-----+----------+
|    b|    a|        13|
|    a|    a|        12|
+-----+-----+----------+


another_dataframe = df.filter(df['count'] == 1).select('attr1', "attr2", "attr3")
+-----+-----+-----+
|attr1|attr2|attr3|
+-----+-----+-----+
|    a|    c|    5|
|    c|    a|    2|
+-----+-----+-----+

How can I apply groupBy in a dataframe without removing other columns of the not-grouped instances in Pyspark?

Answers (1)

Related Questions