Reputation: 75
How do we filter values for CID
from the following df
where the combination of ID
and TS
is same.
df =
ID | CID | TS |
---|---|---|
A | C1 | t1 |
A | C2 | t1 |
A | C3 | t2 |
B | C4 | t2 |
Output DF required
CID |
---|
C1 |
C2 |
Thank you.
Upvotes: 0
Views: 50
Reputation: 42352
You can get the count for each partition of ID and TS, and filter the rows with a count of greater than or equal to 2.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'cnt',
F.count('*').over(Window.partitionBy('ID', 'TS'))
).filter('cnt >= 2').select('CID')
df2.show()
+---+
|CID|
+---+
| C1|
| C2|
+---+
Upvotes: 1