Bulls
Bulls

Reputation: 75

How to subset/create a new Pyspark DF based on values from two columns

How do we filter values for CID from the following df where the combination of ID and TS is same.

df =

ID CID TS
A C1 t1
A C2 t1
A C3 t2
B C4 t2

Output DF required

CID
C1
C2

Thank you.

Upvotes: 0

Views: 50

Answers (1)

mck
mck

Reputation: 42352

You can get the count for each partition of ID and TS, and filter the rows with a count of greater than or equal to 2.

from pyspark.sql import functions as F, Window 

df2 = df.withColumn(
    'cnt', 
    F.count('*').over(Window.partitionBy('ID', 'TS'))
).filter('cnt >= 2').select('CID')

df2.show()
+---+
|CID|
+---+
| C1|
| C2|
+---+

Upvotes: 1

Related Questions