Get groups with duplicated values in PySpark

Question

For example, if we have the following dataframe:

df = spark.createDataFrame([['a', 1], ['a', 1],
                            ['b', 1], ['b', 2],
                            ['c', 2], ['c', 2], ['c', 2]],
                            ['col1', 'col2'])

+----+----+
|col1|col2|
+----+----+
|   a|   1|
|   a|   1|
|   b|   1|
|   b|   2|
|   c|   2|
|   c|   2|
|   c|   2|
+----+----+

I want to mark groups based on col1 where values in col2 repeat themselves. I have an idea to find the difference between the group size and the count of distinct values:

window = Window.partitionBy('col1')
df.withColumn('col3', F.count('col2').over(window)).\
withColumn('col4', F.approx_count_distinct('col2').over(window)).\
select('col1', 'col2', (F.col('col3') - F.col('col4')).alias('col3')).show()

Maybe you have a better solution. My expected output:

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   a|   1|   1|
|   a|   1|   1|
|   b|   1|   0|
|   b|   2|   0|
|   c|   2|   2|
|   c|   2|   2|
|   c|   2|   2|
+----+----+----+

As you can see all groups where col3 is equal to zero have only unique values in col2.

过过招 · Accepted Answer

According to your needs, you can consider grouping statistics according to col1 and col2.

df = df.withColumn('col3', F.expr('count(*) over (partition by col1,col2) - 1'))
df.show(truncate=False)

Get groups with duplicated values in PySpark

Answers (1)

Related Questions