Reputation: 1995
I am trying to generate an operation with groupBy() in Pyspark, but I get the next problem:
I have a dataframe (df1) which has 3 attributes: attrA, attrB and attrC. I want to apply a groupBy operation over that dataframe only taking in account the attributes attrA and attrB. Of course, when groupBy(attr1, attr2) is applied over df1 it generates groups of those instances that are equal to each other.
What I want to get is:
If I apply groupBy() operation and some instances are equal I want to generate an independent dataframe with those groups, and if there are instances that are not equal no any other one, I want to conserve those in another dataframe with the 3 attributes: attr1, attr2 and attr3(not used to group by).
Is it possible?
Upvotes: 0
Views: 1941
Reputation: 4719
from pyspark.sql import functions as f
from pyspark.sql import *
spark = SparkSession.builder.appName('MyApp').getOrCreate()
df = spark.createDataFrame([('a', 'a', 3), ('a', 'c', 5), ('b', 'a', 4), ('c', 'a', 2), ('a', 'a', 9), ('b', 'a', 9)],
('attr1', "attr2", "attr3"))
df = df.withColumn('count', f.count('attr3').over(Window().partitionBy('attr1', 'attr2'))).cache()
output:
+-----+-----+-----+-----+
|attr1|attr2|attr3|count|
+-----+-----+-----+-----+
| b| a| 4| 2|
| b| a| 9| 2|
| a| c| 5| 1|
| c| a| 2| 1|
| a| a| 3| 2|
| a| a| 9| 2|
+-----+-----+-----+-----+
and
an_independent_dataframe = df.filter(df['count'] > 1).groupBy('attr1', 'attr2').sum('attr3')
+-----+-----+----------+
|attr1|attr2|sum(attr3)|
+-----+-----+----------+
| b| a| 13|
| a| a| 12|
+-----+-----+----------+
another_dataframe = df.filter(df['count'] == 1).select('attr1', "attr2", "attr3")
+-----+-----+-----+
|attr1|attr2|attr3|
+-----+-----+-----+
| a| c| 5|
| c| a| 2|
+-----+-----+-----+
Upvotes: 1