Reputation: 17824
For example, if we have the following dataframe:
df = spark.createDataFrame([['a', 1], ['a', 1],
['b', 1], ['b', 2],
['c', 2], ['c', 2], ['c', 2]],
['col1', 'col2'])
+----+----+
|col1|col2|
+----+----+
| a| 1|
| a| 1|
| b| 1|
| b| 2|
| c| 2|
| c| 2|
| c| 2|
+----+----+
I want to mark groups based on col1
where values in col2
repeat themselves. I have an idea to find the difference between the group size and the count of distinct values:
window = Window.partitionBy('col1')
df.withColumn('col3', F.count('col2').over(window)).\
withColumn('col4', F.approx_count_distinct('col2').over(window)).\
select('col1', 'col2', (F.col('col3') - F.col('col4')).alias('col3')).show()
Maybe you have a better solution. My expected output:
+----+----+----+
|col1|col2|col3|
+----+----+----+
| a| 1| 1|
| a| 1| 1|
| b| 1| 0|
| b| 2| 0|
| c| 2| 2|
| c| 2| 2|
| c| 2| 2|
+----+----+----+
As you can see all groups where col3
is equal to zero have only unique values in col2
.
Upvotes: 0
Views: 493
Reputation: 4199
According to your needs, you can consider grouping statistics according to col1
and col2
.
df = df.withColumn('col3', F.expr('count(*) over (partition by col1,col2) - 1'))
df.show(truncate=False)
Upvotes: 1